The eXtensible Markup Language--A C++ Developer's Primer

By: Kenn Scribner for Visual C++ Developer

If there is a technology I believe developers should invest in, it would be in XML. XML is exploding, and considering the propensity to develop thin client applications these days (where XML shines), it only makes good sense. Here is the first of three articles I wrote to introduce people to XML. Because this was intended for publication, some descriptions are very much more brief than I would have liked, but there should be enough here to get you started.

Note the examples are given in Windows using Visual C++ 6.0, and also remember the target audience was C++ developers. Therefore, I use MFC and the MSXML parser. If you're a Linux fan, I'd bet you can still follow along with little trouble. Visual Basic or other language developers should also be able to glean a lot from these articles, even if the example source code might appear cryptic in places. I hope not, so definitely read the articles and see how you do.  :)

Download the code for Part I

Background: I started looking into XML and XSL in 1998 when the initial betas of IE 5 were coming out (in fact, Microsoft released to us a beta of MSXML even before our IE 5 beta). We were developing a client-server application with a thick client that would send processed data to the server for analysis. (In fact, it was our own version of SOAP, though we didn't call it that at the time.) When Kate (Kate Gregory, the editor of VCD) contacted me and asked for an article on XML, I invited myself in for a series of 4! At that time I was completing the SOAP book and wanted to lead the VCD readers through an introduction to XML right to SOAP and Biztalk. That was the goal, anyway...give these articles a read and see if I hit the mark.

The eXtensible Markup Language--A C++ Developer's Primer

Part I, XML: A C++ Developer's Primer
(Part II, The DOM and XSL)
(Part III, SOAP)
(Part IV, Biztalk)

How do we share information? How do we communicate? We use any number of mechanisms, such as conversation and body language. We use music, or art. We use the written word. I could even beat on a tree trunk if that would transmit the information I wish to share to someone else.

If we limit ourselves to discussing the written word, what makes it so compelling? I'd argue it's our ability to render complex thoughts into a format others can read and hopefully understand. When we write, we follow specific rules. We begin with the selection of a character set. This character set identifies the low-level idiomatic symbols we'll use to represent our thoughts. We might use the ANSI character set, as I am here, or we might choose to use Kanji and be much more expressive. We could even use an arbitrary character set we create for ourselves and hope the reader will intuit the meaning.

Given a character set, we select a language (which might play into our selection of the original character set). We select a language because it imposes a set of rules we're bound to follow; then others who know the language can correctly interpret our thoughts. These rules might be syntactical, such as punctuation and spelling, or they may be semantical, such as grammar and sentence construction. Usually the trick is to select a language that most of our audience will understand and properly process. Sometimes we're interested in a particular language because it's easier to express our thoughts using that language.

Finally, we collect our thoughts and apply the character set and language as we record those thoughts for others to access. Typically, this also involves a vocabulary. We have to use words and phrases related to our thoughts. You expect me to use a computing-related vocabulary in this article. You don't expect me to use words such as trans-1,3-dithiane-1,3-dioxide, or worse, discuss how this stuff reacts with aromatic aldehyde to form one of several possible diastereoisomers.

Why XML?

By now you're probably wondering what all of this has to do with XML. I could have been more concise and simply said something like "XML is tagged text stored in a text file." But that's like saying "Music is a series of timed tones." We all understand music to be much more than a series of timed tones. In the same manner, XML is more than simple text in a text file. For one thing, it's not always in a file: The data can be encapsulated in a variety of containers. You could store the text in a text file, or you could ship it around your network in a TCP/IP packet. There's also no requirement that the entire XML document must exist in one container. It could be spread over several files, for example. And other people have written handy XML parsers that you can use to extract the meaning from this file without having to write the code yourself.

If all I wanted to do was to share text between a few systems, I could simply use a tab-delimited text file, or perhaps a DIF file, if you remember that format. Or, just as likely, I would create my own interchange format. Why choose XML over these other formats? A greater audience will ultimately understand XML as a language, and the information I consider important will reach more people than if I chose a proprietary format, which I alone understand, or another format that has a more limited audience, such as tabbed-text.

I also have to consider how the language is interpreted, or in this case, parsed. If I used a tab-delimited text file, I have to hope the programmer who added the tabbed text parser to each application did a good job and can faithfully render my file. There must be dozens, if not hundreds, of tabbed-text parsers out there. There are far fewer XML parsers, and some of them have gained a very wide acceptance. I'll find it more likely my XML information will be parsed by a similar parser, if not the same parser, on a wide variety of systems. This lessens the likelihood my data will be misinterpreted or even completely uninterpretable.

I also have the ability to create my own rules (semantics) that the other systems will be able to use and apply to my XML information. If I don't apply my own rules to the information, the XML parsers will follow a well-defined course of action and parse the information in a standardized fashion. I'm able to create entirely new vocabularies with relative ease. I can even select a character set, although most often you'll find people using UTF-8 (basically ANSI text). The beauty of this is that I can easily change the nature of the XML information and still the XML parser can sift through my information, present it to the target system, and know the information will be faithfully interpreted.

Don't think of XML as merely a way to record information in a text file. Rather, you should think of XML as a new way to exploit information exchange. Changes to your data or format should be easily manageable in the future by today's XML-friendly architectures. If you want to change how your information is interpreted, simply publish a new set of rules, and the XML parsers out there will apply the new rules and extract the information.

Now, let's get more concrete. XML is actually a specification, or to be more precise, a series of related specifications you can find at There is a wealth of XML information at this site. The specifications allow for disparate implementations to follow a single set of guidelines, with the goal that any system that does adhere to the specifications will handle XML information in the same manner as other systems, increasing interoperability.

XML tagging

The basic XML specification, Extensible Markup Language (XML) 1.0, was released in February 1998. Other related technologies followed. The specification describes XML as "a subset of SGML". SGML, or the Standard General Markup Language, features information between textual tags. The tags identify the structure of the document rather than its formatting or style (for presentation). The markup language itself adheres to rigid rules to allow for both human and computational interpretation. You're probably familiar with another subset language of SGML known as the HyperText Markup Language, or HTML, which is used to present information over the World Wide Web. If you're familiar with HTML, the concept of basing a tagging language upon SGML should not be completely foreign to you.

HTML and XML might appear to be the same, yet they're actually vastly different. Here's an HTML document that displays "Hello, World!":

       <title>Basic HTML Markup</title>
    <body bgcolor="#FFFFFF">
       <h1>Hello, World!</h1>

A similar XML file might look like this:

<?xml version="1.0"?>
    <title>Basic XML Markup</title>
    <greeting>Hello, World!</greeting>

There are plenty of similarities here: < and > characters around tag names, / to indicate the end of a tag, nesting of tags (overlapping isn't allowed in either markup language), and a freeform structure that ignores whitespace. The differences become apparent when you look at the tags themselves, which I've indented for clarity. In the XML file, I specified which version of XML I'm using (version 1.0) and I incorporated "Hello, World!" between the <greeting></greeting> tags, sometimes referred to collectively as an entity. At no time did I say anything about how the phrase should be displayed! I merely gave you the phrase. The HTML file, on the other hand, did tell you how to display the phrase. See the <h1></h1> tag pair? While individual browsers are free to interpret more precisely what the HEADING1 style implies (bold, large font, or otherwise), the document's author very clearly dictated the phrase was to be in this formatting style. To XML, the content is more important than the style. To HTML, the style is more important than the content.

The HTML tags have precise meaning, which is specified by this tag:


For example, the HTML file encapsulates the marked-up data between the <html></html> tag pair. If you have a title, it will be between the <title></title> tag pair, which falls within the <head></head> tag pair, and so on. You know this because the DTD, or Document Type Definition, tells you so. The DTD lays out the framework for an HTML document and gives meaning to all of the HTML tags. It also tells you which tags must be paired, and, if a tag doesn't require an end tag, what conditions terminate the scope of the tag. A good example is the HTML PARAGRAPH tag, <p>. Normally, you don't see people use the </p>—it's far more common to see an author use another <p> tag. The mere existence of a second <p> tag implies the termination (</p>) of the scope of the first tag.

XML is much more formal about tagging rules. All XML tags have a beginning and ending tag, though there's a shortcut if there's no data. For example, if I have a title for my XML document, I could wrap it in the <title></title> tag pair. If I want to emphasize I have no title, rather than omit the tags entirely, I could write them both in this shorthand notation: <title/>. Another rule is that nested tags can't overlap. The following is incorrect XML:

<title>Malformed XML<greeting>Don't do this!</title>

Here's something to keep in mind—XML itself doesn't specify what the tags represent or what text should be embedded within the tag. The tags I used in my XML example were completely arbitrary, with the exception of the processing instruction <?xml version="1.0" ?> (more about this shortly) . The existence of the tags is what's important, as well as the fact that they have a beginning and an ending tag to delineate scope. You can, if you want, mandate a tag architecture by creating your own DTD, which I'll also address later in this article.

XML and HTML are similar in that you can specify attributes within the tag. In this HTML tag, the background body (document) color is specified to be white:

<body bgcolor="#FFFFFF">

How the attributes and tags are related is also specified in the DTD. The syntax for an attribute is always the same, though: attrib="value". In this case, attrib refers to the attribute itself, and the value is the value the attribute is to reflect from this point in the document forward (until later modified, if possible).

XML and Data

You might be tempted to argue that HTML could be used to represent information content as well as to present it. The problem with HTML is twofold. First, HTML loses the structural association between data items. For example, is the information recorded between <div></div> tags and <li></li> tags in this example hierarchical?

       <li>Bullet 1.1</li>
       <li>Bullet 1.2</li>
       <li>Bullet 1.3</li>
          <li>Bullet 2.1</li>
          <li>Bullet 2.1</li>
          <li>Bullet 2.1</li>

Without some insight, we can't tell. HTML only tells us we have two bulleted lists. When displayed in a browser, the lists will indent from the left edge of the browser window the same distance, so the fact one <div></div> pair is embedded within another isn't displayed to the user. You can see this in Figure 1.

The second problem is that the HTML tags allow for ambiguous interpretation of the data between the tags. Consider this example—how would you interpret the data?

    <h2>John and Marsha Doe</h2>
    <h3>123 Elm Lane<h3>
    <h3>Anytown, NY<h3>
    <h2>(123) 456-7890</h2>

In this case, the tags merely represent how the text should be rendered, not what the text represents. Clearly you see the data as an address, but to an automated system there's no syntactical difference between the street address and the postal ZIP code. Even if you applied semantical meaning to the ordering of tag tuples, you still have to account for differences in addresses:

    <h2>John's VaporWare, Inc.</h2>
    <h3>123 Elm Lane<h3>
    <h3>Suite 103, far desk on left</h3>
    <h3>Anytown, NY<h3>
    <h2>(123) 555-9876</h2>

The fact that there's an additional modifier to the street address (John's in a suite) complicates automated parsing matters tremendously.

However, take an XML approach and wrap the data in meaningful tags (note I added a comment tag, which is syntactically the same as an HTML comment):

    <!-- Registered voter information -->
    <name>John and Marsha Doe</name>
       <street1>123 Elm Lane</street1>
    <phone>(123) 456-7890</phone>

Here, there's a clear distinction between the voter's name, address, and phone number. You can also see the address grouping and how a secondary street address modifier is disambiguated. This neatly handles the two limitations I mentioned with using HTML to express data content. The data is very clearly organized, and the tags leave no ambiguity as to their data content. What's even better, I'll discuss a mechanism in the next article you can use to convert this XML to any other format you like, including HTML, RTF, or even tab-delimited text! That way, you can keep your data content separate from the presentation of the content.

XML and the XML document

XML documents, like HTML documents, follow a standard layout. That is, you expect to see certain things in certain places. XML documents are made up of declarations, comments, elements, and processing instructions. Some are required, while others are optional. The usual logical arrangement is to have a prolog followed by a document element.

The prolog introduces the document to the parser and provides it with some basic information, all optional. It consists of the XML declaration, which is actually a processing instruction, and the document type declaration. You've seen the XML declaration:

<?xml version="1.0"?>

This simply tells the parser which version of the XML specification this document conforms to (always in lower case, just as you see here). Note the question marks—this denotes a processing instruction. The version in use is indicated by the version attribute. The document type declaration can be more complicated, depending upon what information is included there. It typically follows this form:

<!DOCTYPE decl>

Here, decl tells the parser what rules to use when parsing the remaining XML information. In a nutshell, it identifies the DTD to use. I'll cover this in a bit more detail in the next section, where I discuss well-formed vs. validated XML.

Following the prolog, if there is one, you'll always find the document element. Think of it as the root node of a tree data structure. The document element is represented by a tag pair, and all XML information your document contains must be embedded (nested) between this tag pair, often referred to as the root element or root tag. Feel free to nest as much or as little information within the root element as you like. You could create a frivolous document such as this:


Or you could tap into the powerful features XML provides you. Just remember that the whole document must reduce to a single element: If you want your file to hold 10 voter records, then you need another tag, such as <voters>, to surround them all.

Well-formed vs. validated XML

The XML examples I've shown you so far, though simplistic, are examples of well-formed XML. That is, for each beginning tag there was an ending tag. No tags overlapped improperly, which is to say they all followed proper nesting rules. Any contemporary non-validating XML parser would be able read the XML I've presented and make sense of it. The key is non-validating. Let's see what all of this means.

When people refer to well-formed XML, they're referring to XML that follows the specification. You can provide a DTD with your XML document to apply more rigid syntax and semantics to your tags. If you provide a DTD, a validating XML parser will read the DTD to discover what rules should be applied to the XML document, then it will parse the document according to those rules. Change the DTD and you change the rules, but the parser will still read the XML data (assuming no other error). A document that meets all of the rules of its DTD is said to be valid XML.

DTDs, as you might have guessed, aren't written in XML—they're used in older technologies like HTML. They're written in their own language, and naturally they have their own vocabulary. They can come in two parts: an internal DTD and an external DTD. You can have either one, both, or neither. An internal DTD is incorporated into your XML document. The external DTD is a reference to a file that contains the DTD information the parser should locate, read, and apply to the XML document. Either type of DTD is specified using a special XML tag, DOCTYPE (the same as is used for HTML). I'll give an example of an internal DTD. (The application of an external DTD is similar. Only the specification within the DOCTYPE tag differs—the internal DTD actually lists the rules in this declaration, while the external DTD is merely referenced here.)

External DTDs make sharing data much simpler. You and the other application developer agree on the rules for the data you'll share, build a DTD that describes those rules, and put the DTD somewhere both applications can reach—for example, on a public Web server. All the XML parsers can use this same external DTD—and you're sure you're all using the same rules.

Let's go back to my address example. To a non-validating XML parser, the following XML:

<street1>1122 Maple Street</street1>
 <street2>Apt 12J</street2>

is equivalent to this XML:

<street2>Apt 12J</street2>
 <street1>1122 Maple Street</street1>

The XML has no inherent syntax rules (overlapping tags), so it's well-formed and therefore legal. However, to you and I the two examples have very different meanings. The first example makes sense, but the second example is backwards. You typically don't specify the apartment number before you specify the street address of the apartment building to which you refer. The information here requires semantics as well as proper syntax.

So, let's create a DTD that mandates the tags and their order. If you run this XML file through a validating XML parser, the file should parse correctly:

<?xml version="1.0" ?>
    <!-- Voter registration DTD -->
    <!-- Next follow zero or more voter records -->
       <!-- Registered voter information -->
       <name>John and Marsha Doe</name>
          <street1>123 Elm Lane</street1>
       <phone>(123) 456-7890</phone>

In fact, I created an XML file and typed in the code you just saw. As it happens, Internet Explorer 5.0 (IE5) is capable of parsing XML. I opened the file using IE5, and you can see the results in Figure 2. One interesting feature about IE5 is that it displays the XML in a tree form, and you can expand and contract the XML "nodes" to see more or less information at will.

However, if you swapped the order of the street addresses, you should receive an error to the effect of "Invalid element content." That's because the DTD I created says that an address element consists of a street1 element, a street2 element, and so on in that order, as indicated by the comma-separated list. If order didn't matter, you would write the DTD like so:


The #PCDATA indicates the given entity, such as NAME, consists of parsed character data vs. a literal value. Parsed character data tells the parser to examine the content for XML tags and apply the rules there also. You could alternatively use the #CDATA type to indicate the content may contain XML-like tags and that the parser is to ignore anything between the valid tags as specified in the DTD.

There's a great deal more to writing DTDs than I've mentioned here, and since I'd rather concentrate on XML, I'll leave my coverage of DTDs at this point. If you gather from this you can have well-formed vs. validated XML, and that to validate XML you must create a DTD and use a validating parser, that's enough for now. You'll find more regarding DTDs at .

XML namespaces

If you're familiar with namespaces in C++, you'll be pleased to know XML provides for the same concept. The XML namespace, as defined by the January 1999 W3C recommendation "Namespaces in XML," is described as follows:

"An XML namespace is a collection of names, identified by a URI reference [RFC2396], which are used in XML documents as element types and attribute names. XML namespaces differ from the 'namespaces' conventionally used in computing disciplines in that the XML version has internal structure and is not, mathematically speaking, a set."

Perhaps an example will help. Consider the XML document in Listing 1 (click on the Listing link for the code). In this case, the tag tuple <wavelength></wavelength> is used twice. Each use is qualified using a namespace. Assuming the URLs I invented actually exist (and I hope they don't!), and that they specified the semantics of the namespace, the wavelength information is completely different for potato chips vs. light. Here, I'm saying the potato chip has 15 waves per inch, whereas I'm talking about visible (red) light in the case of the physics namespace. Clearly there is no meaningful association between waves in a potato chip and the wavelength of visible light, yet if I omitted the namespaces, could you tell?

<?xml version="1.0" ?>
    <wavelength>15 wpi</wavelength >
    <wavelength>700 nm</wavelength >

By using namespaces in my example, I now know there are 15 waves per inch in my potato chips and that visible red light is also important. Without the namespaces, I can't tell in what context wavelength is being used, or that there are even multiple uses (semantics) for the tag. This feels a lot like applying a data type to a given XML tag, which is where schemas come in.

XML and data types

DTDs are the parsing rules for the XML document. The DTD will tell you what tags are required, what tags can be used multiple times, and in what order you should expect to see the tags. What DTDs don't tell you is what the tags mean. That's the purpose of the schema. The schema defines an XML vocabulary, which is another way of looking at data types. Schemas are written in XML Data, which is a dialect of XML, rather than another language. Unlike DTDs, this should make them somewhat easier to read and understand if you have to use them. (Not all parsers support schemas yet—keep that in mind when choosing a parser.) The schema is a replacement for the DTD: easier to read, easier to write, and with more functionality.

You can create your own schema and declare your own data types, although Microsoft has done some of that for you. Before I actually write a schema, let me show you one method you can use to apply data types from a predefined schema to your XML document. You might even recognize the syntax—XML namespaces.

Listing 2 shows a brief example of a typed field, using the Microsoft basic data type schema (click on the Listing link for the code).

In this case, the name data is of the string type, while the employeenum data consists of integer information. You can see I also used an alternate form for specifying the namespace. I used the XML keyword xmlns associated with a URN. The namespace is also scoped to be valid only within the type_example tag pair. This example also shows how you apply a namespace to an attribute. In this case, the attribute type has the dt namespace applied. The namespace dt is defined in a standard Microsoft data type schema.

If you want to create your own data types, you'll need to write a schema yourself. One benefit of this is you won't require a DTD, as the schema functionally replaces the DTD. A drawback is you need to learn XML Data. Luckily, understanding XML Data isn't too difficult, since the keywords it uses are relatively self-explanatory. Be sure to see for more information regarding schemas, as I won't be able to cover more than the basics here.

Like DTDs, schemas refer to tag pairs as entities. An entity may have attributes, and any data typing we do can be applied to the entire entity or just an attribute. In our schema, we'd not only want to apply the data typing rules, but we'd also want to specify the semantical rules we formerly defined in the DTD (name followed by address, which consists of this and that, and so on). Listing 3 shows just such a schema (click on the Listing link for the code).

Essentially, this schema lays out the voter record much as we did with the DTD. The difference is the data type is also specified. Here we have elements, as noted by the ElementType tags, and with each element, we have a data type specified by content. In the cases I show here, the content is textOnly, which tells you the element might contain text but not other elements. In other schemas, the content could be mixed, which indicates the value in the XML element can be purely textual or contain other XML elements.

And as you might expect, you can create your own complex data types, as I've done with the address element. The address element is something like a structure definition, in that I dictate the address element is composed of street1, street2, city, state, and zip elements. I further state that there must be one and only one of each sub-element. Of course, the entire document object (voters) is similarly described as a set of zero or more voter elements, each of which in turn consist of a name, an address, and a phone element.

I also elected to apply a data type to the terminal ElementType elements:

<ElementType name="name" content="textOnly" 
 model="closed" dt:type="string"/>

Here, I'm saying the name element must be a string value. Other data types you can select from include, but certainly aren't limited to, number, integer, and Boolean. These come from the Microsoft data type schema identified by the URN schemas-microsoft-com:datatypes. Note I marked the ElementType's model attribute as closed (vs. open, which is the default). The model attribute controls the ability of its recipients to add undeclared (via your schema) attributes and sub-elements. The open model is very flexible, but it allows end users to add things to your XML data you might not have intended. If you use a closed model, the XML data must strictly conform to your schema.

If you take this schema code and type it into a file called ExampleSchema.xml, you should be able to reference the schema information when parsing XML data. In this case, I've modified the address example's XML code to match the following:

<?xml version="1.0"?>
 <vtr:voters xmlns:vtr="x-schema:ExampleSchema.xml">
    <!-- Next follow zero or more voter records -->
       <!-- Registered voter information -->
       <vtr:name>John and Marsha Doe</vtr:name>
          <vtr:street1>123 Elm Lane</vtr:street1>
          <vtr:city>Anytown </vtr:city>
       <vtr:phone>(123) 456-7890</vtr:phone>

I've used a namespace (vtr), and I've indicated the schema file is located in the same directory as the XML data file itself. Now if I view the XML data in IE5, I see the output shown in Figure 3.


In my next article, I'll discuss the XML Document Object Model (DOM), and the eXtensible Stylesheet Language, or XSL, which you can use to convert raw XML into other formats for exchange with non-XML systems or for presentation.

The goal of the XML DOM is to break XML down into constituent objects. For example, consider the document element, which acts as the root node of the XML data tree. The DOM considers that an object (NODE_DOCUMENT). When you think of the DOM, you can think of it in two ways. On the one hand, you have a standardized, logical structure. This object contains those objects, which are composed of these other objects, and so on. But you also allow for an automated approach to accessing portions of the document and lay the foundation for a programming interface people could implement and use to actually automate document processing.

XSL is another XML-based language, like XML Data for schemas, that allows the XML parser to convert the base XML document information into another format, such as HTML, for presentation. While XML Data allowed you to statically specify, in the schema, how the document's elements were to be arranged and what data types they must contain, XSL is more like a programming language. XSL allows you to rummage through the XML information contained in your XML document and do things with the information, based upon your style sheet design and what the XML parser finds in the document at execution time.

Wrapping up

I've barely scratched the surface of XML and the XML-related technologies. That's because there's a lot to XML. While the sheer amount of available XML information might seem daunting at first, in reality XML is a compelling technology that really does make things easier in many areas of computing today. A good place to start your research is (you guessed it) . Similarly, you'll find a lot of information available at many corporate Internet sites, such as Microsoft's and IBM's.

I've mentioned I'll be discussing the XML DOM and XSL in the next article. I'll also show you some C++ code that munches through an XML document using an easily obtained XML parser—the one that ships with Internet Explorer 5.0. It's a fabulous tool, and it's well worth the effort to examine it more closely. See you on the Web!

Comments? Questions? Find a bug? Please send me a note!

[Back] [Left Arrow] [Right Arrow][Home]