XML and SAX--New Capabilities for C++ Programmers

By: Kenn Scribner for Visual C++ Developer


Download the code

Background: As I was writing the XML series, Microsoft released a new version of MSXML with SAX support--very cool stuff if you're into XML, as you can do things with SAX that are lightning fast when compared to the DOM (for some applications...both SAX and DOM are important and have their respective places in the XML processing tool lineup). This article introduced SAX and offered my attempt at SAX ATL support (which surprisingly even seems to work).

XML and SAX--New Capabilities for C++ Programmers

If you haven't already downloaded the latest in Microsoft XML parser technology, visit msdn.microsoft.com/downloads/webtechnology/xml/msxml.asp. Here you can download the new parser and its SDK (header files, libraries, documentation, and so on). It's free for the cost of the bandwidth...

The May 2000 XML parser release
The May 2000 XML parser release extends the March 2000 parser release by incorporating four additions:

• Bug fixes (a complete list of these can be found at msdn.microsoft.com/workshop/xml/general/msxml_buglist.asp)

• Enhanced XSLT/XPath support

• Additional DOM (document object model) interfaces, primarily for XSLT, XPath, and schemas

• SAX support

The references to XPath and XSLT can be thought of collectively as "XSL," if you read my April article (Part 2 in this series). XSL itself has been decomposed into several constituent technologies, and this version of the DOM parser better supports those individual pieces. XPath is the part of the XSL language that deals with data manipulation, document traversal, and node matching expression evaluation. XSLT is the piece of XSL that transforms the XML document into another form, usually a different XML representation or into HTML for display using a style sheet you provide (you might remember the transformNode() method from April's article). This edition of the parser strengthens the separation of these two technologies, mirroring the direction of XML itself, while at the same time adding functionality to both to the benefit of each. You can read more about these technologies specifically at www.w3.org/Style/XSL.

The XSLT and XPath additions are terrific, and if you're into Web development, you'll no doubt be impressed with the new capabilities. And we all applaud bug fixes! But I find the most exciting addition by far to be the new SAX support that's built into the parser. Let's see what SAX is all about, and then I'll provide an ATL-based sample for you try on your own.

The simple API for XML
If you step back for a moment and look at an XML document, you can easily see the hierarchical nature expressed in the XML itself. That is, XML documents can easily be represented in the form of a tree. I described this in my April article introducing you to the XML DOM.

In April, to reiterate, I used this example XML document to show the hierarchical nodal relationships:

<?xml version="1.0"?>
 <myxmldoc>
    <title>Basic XML Markup</title>
    <greeting>Hello, World!</greeting>
 </myxmldoc>
 

Graphically, this XML document (albeit a simple one) can be represented as you see in Figure 1.

This natural document structure gives rise to the XML DOM, which formalizes the arrangement of nodes and node types. If you'd like to brush up on the DOM and how to parse an XML document as DOM nodes, please refer to my April article.

But there's another way to view this XML document, at least if you're a finite state machine. Imagine for the moment that your task in life is to read text streams for XML information, and anytime you come across something you recognize, you send a note to your boss. In this case, assume you were handed the XML document I just presented, character by character.

The first character you'd see is the opening tag bracket (<). You recognize this as an opening tag, so you interpret characters as a tag until you see the closing bracket (>)—at least for purposes of this illustration. In this case, you first interpret the initial processing instruction <?xml version="1.0"?>. Since you recognize this as an XML processing instruction, you send a note to your boss to this effect. Then you interpret the document element <myxmldoc>, followed by the rest of the document's information.

The point of this illustration is two-fold. First, you don't see the data as hierarchical. The data, from your point of view, is merely a stream of characters to which you apply XML rules. You look for opening tags, closing tags, and general XML data (you can also validate the XML data as it comes to you by using a provided schema or DTD, although not with this particular release of the SAX parser). The second point is that you send notifications to some interested party when you discern items of interest, such as XML processing instructions, tags, and so on. In more formal terms, you're implementing (or at least imagining) an event-driven parser, also known as a push parser.

This is what SAX provides—a push parser for XML documents. Stop thinking like the parser for a minute and imagine the code you're going to write—that's the "interested party" the parser sends notifications to. If you're interested in all of the XML document's nodes, then you add an event handler for XML document nodes. Or, if you'd rather quickly scan an XML document for processing instructions, you add a handler for those instead of, or in addition to, normal XML nodes.

The key here is quickly. You can very quickly scan large XML documents for specific information using SAX. Unlike a DOM parser, you don't need to load the entire document into memory and process all of the nodal relationships before you extract information from the document. Instead, you stream through the document for the information you require, then discard the document without having loaded the entire thing into memory. You save both time and resources, and in many situations that's critical.

You can also implement your own document data structure, for those situations where a tree isn't as appropriate as some other data structure. For example, as your SAX node handler accepts events from the SAX parser, indicating that a new XML node has been interpreted, you could push the data contained within the node onto a stack or add it to a data queue. How you process the data from that point is completely up to you. You aren't forced to deal with your document's data in tree form.

There are a couple of areas in which the SAX parser falls short of the DOM, the most glaring of which is that SAX doesn't allow you to randomly access information in your XML document. Instead, you start at the beginning and run through the document much like searching for a video clip on a videotape. Accessing the same video clip from a DVD ROM is faster in this particular case. Another shortcoming is that you can create documents using the DOM, whereas the SAX parser's view of the document is read-only. The document must exist to be parsed! You can't start with a blank sheet of paper, as you can with the DOM. And finally, complex searching can be problematic, as you need to store intermediate search results while running through the remainder of the XML document. For each of these three cases, the DOM might be a better alternative. It pays to know how to use the tools!

A closer look at the SAX parser
Unlike the XML DOM, which is a specification under control of the W3 (the organization responsible for overseeing many of the Internet standards), the SAX is more the brainchild of the XML-DEV newsgroup at large. There's a SAX home page, however, at www.megginson.com/SAX (that's the best place to start searching for any specific information you require).

In this article, I'll direct my discussion to the Microsoft SAX2 implementation. General-purpose SAX parsers operate in much the same way as the Microsoft parser, with the exception that the Microsoft parser is COM-compatible (not too surprising!). The important point to note is that, as with the DOM parser, the SAX parser is available to you as a set of COM interfaces. This is not how the general public might use the SAX parser, unless they also use the Microsoft implementation.

With that in mind, let's look at the set of SAX interfaces the parser provides for your use. Table 1 shows you these with a brief description of each.

Table 1. XML SAX interfaces and their uses.

SAX Interface Purpose
ISAXAttributes Provides notifications of and access to XML node attributes
ISAXContentHandler Provides notifications of and access to basic XML document information
ISAXDTDHandler Provides notifications of DTD-related events
ISAXEntityResolver Provides notifications of references to external entities
ISAXErrorHandler Provides for customized error handling
ISAXLocator Associates a SAX event with a location in the XML document
ISAXXMLFilter Allows you to filter (modify) the XML stream before event notification
ISAXXMLReader The base parser interface

As you can see, there aren't many SAX interfaces to memorize. Furthermore, some of them either aren't implemented in the current version (ISAXEntityResolver), or they contain event handler methods that aren't called by the parser at this time (ISAXErrorHandler).

The most important interfaces are ISAXXMLReader and ISAXContentHandler. You initialize the parser using ISAXXMLReader by providing it pointers to objects designed to handle event notifications. A good example of this is an object that implements ISAXContentHandler, as the parser sends the notifications to this object when it encounters nodes, processing instructions, and so on. In the specific case of the content handler, you might follow this approach:

1. Create the parser object using CoCreateInstance().

2. Create your object that implements ISAXContentHandler.

3. Bind your content handler to the parser using ISAXXMLReader::PutContentHandler().

4. Handle event notifications as required.

You'd follow a similar line of reasoning for other handlers, such as for errors (ISAXErrorHandler), DTD information (ISAXDTDHandler), and attributes (ISAXAttributes). If you don't want to handle a particular event, you simply don't bind a handler to the parser. In those cases, the parser simply doesn't flag the event, and processing continues.

Microsoft has provided a simple SAX demonstration, which I've included in the accompanying Download file. Here's the (abbreviated) code to invoke the parser:

ISAXXMLReader* pRdr = NULL; 
 
 HRESULT hr = CoCreateInstance(CLSID_SAXXMLReader,
                               NULL, 
                               CLSCTX_ALL, 
                               IID_ISAXXMLReader, 
                               (void **)&pRdr); 
 
 ...
 
 MyContent * pMc = new MyContent(); 
 hr = pRdr->PutContentHandler(pMc); 
 
 ...
 
 hr = pRdr->ParseURL(URL,-1); 
 printf("\nParse result code: %08x",hr); 
 
 pRdr->Release();
 
 delete pMc; 
  

In this case, the example's author creates an instance of the SAX reader object as well as an instance of the content handler (note that it isn't a true COM object), marries the two using PutContentHandler(), and parses the XML document (the URL of which is contained in the string URL, passed as a parameter to ParseURL()). I noticed that the example application didn't delete pMc, so I added that line of code myself. I didn't see in the documentation where the parser would take responsibility for managing handler memory, so I assume it doesn't and the omission of the delete is an oversight on the example author's part.

A more interesting example, in my opinion, is one that uses a COM object for parse events. But to do this, you need to implement every method of every SAX object interface, which is painful. But then, that's why I'm here!

Since I'm enamored of ATL, I decided to implement the parser interfaces in ATL templates. That way, you could use as little or as much as you'd like. My implementations do very little, but if you're not interested in a specific event, that isn't too bothersome. If you're interested in a specific event, you simply provide an implementation in your derived class, and the parser uses that instead of my templated versions. Perhaps some code (with explanation) will make this more clear. Open the SaxObj1 project if you want to follow along.

The ATL SAX interface implementations can be found in atlsax.h, included in the Download file. Although there are implementations for all of the SAX interfaces, I'll only show you ISAXContentHandler:

//////////////////////////////////////////////////////
 // ISAXContentHandlerImpl
 template <class T>
 class ATL_NO_VTABLE ISAXContentHandlerImpl 
                   : public ISAXContentHandler
 {
 public:
     STDMETHOD(PutDocumentLocator)
                      (ISAXLocator __RPC_FAR *pLocator)
     {
         return S_OK;
     }
 
     STDMETHOD(StartDocument)()
     {
         return S_OK;
     }
 
     STDMETHOD(EndDocument)()
     {
         return S_OK;
     }
          
     STDMETHOD(StartPrefixMapping)
                  (const wchar_t __RPC_FAR *pwchPrefix,
                   int cchPrefix, 
                   const wchar_t __RPC_FAR *pwchUri, 
                   int cchUri)
     {
         return S_OK;
     }
          
     STDMETHOD(EndPrefixMapping)
                 (const wchar_t __RPC_FAR *pwchPrefix, 
                  int cchPrefix)
     {
         return S_OK;
     }
          
     STDMETHOD(StartElement)
           (const wchar_t __RPC_FAR *pwchNamespaceUri, 
            int cchNamespaceUri, 
            const wchar_t __RPC_FAR *pwchLocalName, 
            int cchLocalName, 
            const wchar_t __RPC_FAR *pwchRawName, 
            int cchRawName, 
            ISAXAttributes __RPC_FAR *pAttributes)
     {
         return S_OK;
     }
          
     STDMETHOD(EndElement)
           (const wchar_t __RPC_FAR *pwchNamespaceUri, 
            int cchNamespaceUri, 
            const wchar_t __RPC_FAR *pwchLocalName, 
            int cchLocalName, 
            const wchar_t __RPC_FAR *pwchRawName, 
            int cchRawName)
     {
         return S_OK;
     }
          
     STDMETHOD(Characters)
                  (const wchar_t __RPC_FAR *pwchChars, 
                   int cchChars)
     {
         return S_OK;
     }
          
     STDMETHOD(IgnorableWhitespace)
          (const wchar_t __RPC_FAR *pwchChars, 
           int cchChars)
     {
     return S_OK;
     }
         
     STDMETHOD(ProcessingInstruction)
                  (const wchar_t __RPC_FAR *pwchTarget,
                   int cchTarget, 
                   const wchar_t __RPC_FAR *pwchData, 
                   int cchData)
     {
         return S_OK;
     }
          
     STDMETHOD(SkippedEntity)
                    (const wchar_t __RPC_FAR *pwchName,
                     int cchName)
     {
         return S_OK;
     }
 }; 
  

Note that atlsax.h assumes you have xmlsax.h (from the Microsoft XML SDK) on your include file path somewhere! If not, you'll need to download the SDK from the Microsoft MSDN site (I provided the URL earlier in the article).

If you want to create a COM object that handles SAX content events, you include the content handler implementation in your C++ (ATL) object inheritance list (from SAXObject1.h):

class ATL_NO_VTABLE CSAXObject1 : 
     public CComObjectRootEx<CComSingleThreadModel>,
     public CComCoClass<CSAXObject1, 
                   &CLSID_SAXObject1>,
     public ISAXContentHandlerImpl<ISAXObject1>,
     public ISAXErrorHandlerImpl<ISAXObject1>,
     public IDispatchImpl<ISAXObject1, 
                   &IID_ISAXObject1, &LIBID_SAXOBJ1Lib>
 {
     ...
 };
 

This provides your object with the base implementation for ISAXContentHandler, which you'll tailor later. (Note that this example also handles SAX errors as well.) As with all COM interfaces for which external object clients can query, you then add ISAXContentHandler to the ATL COM interface map:

BEGIN_COM_MAP(CSAXObject1)
     COM_INTERFACE_ENTRY(ISAXObject1)
     COM_INTERFACE_ENTRY(ISAXContentHandler)
     COM_INTERFACE_ENTRY(ISAXErrorHandler)
     COM_INTERFACE_ENTRY(IDispatch)
 END_COM_MAP()
 

At some point, you need to bind your handler to the parser. That means you must follow the steps I outlined previously, which I do in my COM object's FinalConstruct() method:

HRESULT CSAXObject1::FinalConstruct()
 {
     // Create the SAX parser
     HRESULT hr = S_OK;
     try {
         // Create the reader
         hr = m_pReader.CoCreateInstance(
                          __uuidof(SAXXMLReader));
         if ( FAILED(hr) ) throw hr;
 
         // Hook up the content handler
         hr = m_pReader->PutContentHandler(this);
         if ( FAILED(hr) ) throw hr;
 
         // Hook up the error object
         hr = m_pReader->PutErrorHandler(this);
         if ( FAILED(hr) ) throw hr;
     } // try
     catch(HRESULT hrErr) {
     // Some COM error...
         hr = hrErr;
     } // catch
     catch(...) {
         // Some error
         hr = E_UNEXPECTED;
     } // catch
 
     return hr;
 }
 

As you might recall, the first step is to create the parser object:

hr = m_pReader.CoCreateInstance(
                           __uuidof(SAXXMLReader));
 

The second step is to create the handler object, which has been done by COM and ATL (the C++ class constructor has executed, so we have a valid this pointer). Therefore, we simply bind ourselves to the parser:

hr = m_pReader->PutContentHandler(this);
 

Now comes the really cool part. The base implementation for ISAXContentHandler does nothing more than return S_OK for all of the interface methods, and perhaps this is appropriate for the majority of them. But let's say you're interested in the XML document's nodes and want an event whenever the parser determines a new node is coming through. In that case, you'd want to implement ISAXContentHandler::StartElement(). To do this, you first add the method to the class definition:

// ISAXContentHandler overrides
 STDMETHOD(StartElement)(
             const wchar_t __RPC_FAR *pwchNamespaceUri,
             int cchNamespaceUri, 
             const wchar_t __RPC_FAR *pwchLocalName, 
             int cchLocalName, 
             const wchar_t __RPC_FAR *pwchRawName, 
             int cchRawName, 
             ISAXAttributes __RPC_FAR *pAttributes);
 

Then you add the meat of the method (from SAXObject1.cpp):

STDMETHODIMP CSAXObject1::StartElement(
             const wchar_t __RPC_FAR *pwchNamespaceUri,
             int cchNamespaceUri, 
             const wchar_t __RPC_FAR *pwchLocalName, 
             int cchLocalName, 
             const wchar_t __RPC_FAR *pwchRawName, 
             int cchRawName, 
             ISAXAttributes __RPC_FAR *pAttributes)
 { 
     prt("\n<%s>",pwchLocalName,cchLocalName); 
 return S_OK; 
 } 
 

For this example, I used the same prt() method that the Microsoft author created for the SAX example application. In this way, my ATL example and Microsoft's command line example provide similar functionality, allowing you to concentrate on the COM-specific aspects of my sample code. If you run my ATL COM example, you'll find it produces the same results as the Microsoft example application (or at least it should!). I wrote a console-based test application called TestSaxObj1 and the interesting portion of main() is:

CoInitialize(NULL);
 
 HRESULT hr = S_OK;
 try {
     CComPtr<ISAXObject1> pSAXObj1;
     HRESULT hr = 
       pSAXObj1.CoCreateInstance(__uuidof(SAXObject1));
     if ( FAILED(hr) ) {
         cout << "Error creating object" << endl;
         throw hr;
     } // if
 
     CComBSTR bstrFile(argv[1]);
     hr = pSAXObj1->LoadAndGo(bstrFile);
     if ( FAILED(hr) ) {
         cout << "Error parsing document" << endl;
         throw hr;
     } // if
 } // try
 catch(...) {
     // Check for an error record
     //(Error handling code removed for brevity.)
 } // catch
 
 CoUninitialize();
 

That's probably what you'd expect—it creates an instance of my homebrew COM object, which encapsulates the SAX parsing capability. The application then interprets the command line and passes the filename to the COM object for parsing. The COM object then parses the XML information contained within the file and prints the results to the console.

I'm sure you'll find many uses for the SAX-based XML parser. One I can easily imagine is to use the SAX parser to extract method information from SOAP packets (fault or valid method data). (Be sure to see my July article regarding SOAP for more information about that XML-based technology!) In any case, I'm sure you'll agree this adds an exciting new dimension to your XML processing capabilities!

Comments? Questions? Find a bug? Please send me a note!


[Back] [Left Arrow] [Right Arrow][Home]