XML Document Layout and Syntax

An XML document contains a hierarchy of data values wrapped inside XML elements and includes various other entries to assist in processing this information. The data structure and processing entries must conform to exacting rules.

Standards and specifications for designing XML documents are maintained by the World Wide Web Consortium (W3C) and can be found here. These recommendations give the structural and syntactical rules for creating XML documents and standards for testing their validity.

An XML document contains two major sections -- the prolog and data structure sections. The prolog identifies the document as an XML document and includes other optional entries for specialized processing. The data structure section includes the hierarchy of data elements representing the document's information content.

The Prolog

The prolog section of an XML document contains three optional elements:

XML declaration
Processing instruction
Document type definition

The XML Declaration

The XML declaration, if included, must appear as the first line of the document. The declaration consists of a mandatory version number, an optional encoding declaration specifying the character set used, and an optional indication of the stand-alone nature of the document.

<?xml version="1.0" [encoding="UTF-8"] [standalone="yes"] ?>

The current version is 1.0 and must be included. An encoding declaration is needed only when using a non-Latin (non-English) character set. The default character set is UTF-8 and does not need to be coded. The standalone="no" attribute is required when reference is made to an external file containing additional declarations for the current document (see "Document Type Definition" below). The default value is standalone="yes" (this is a stand-alone document) and the attribute can be omitted.

An XML declaration is optional, but it is always a good idea to include it since W3C recommendations state that a "well-formed" document begins with an XML declaration.

Processing Instructions

A processing instruction is a reference to a supporting document containing processing instructions relevent to the XML document. Most likely, this processing instruction refers to an external style sheet used to transform the document in some fashion, normally to format the XML data content for Web page display. A processing instruction that links to an external style sheet appears as shown below.

<?xml-stylesheet type="text.xsl" href="url"?>

Working with XML style sheets to perform transformations of XML documents into Web pages and into other XML documents is considered later in these tutorials.

Document Type Definition

A document type (DOCTYPE) declaration is required if the XML document is to be validated against a Document Type Definition (DTD). A DTD is a set of rules for interpreting the data elements and their relationships. For example, DTD entries can indicate proper names and valid data types for data elements in the document. The general formats for DOCTYPE declarations are shown below.

<!DOCTYPE root node [ <!ELEMENT declaration> ... ]>
<!DOCTYPE root node SYSTEM "url">

An internal DTD can be coded as part of the XML document, or an external DTD can be coded in a separate document. If DTD specifications are in an external document linked by the url, then the attribute standalone="no" must be coded in the XML declaration. An XML document that satisfies DTD specifications is said to be "valid" in addition to being well-formed. DTD validation is considered later in these tutorials.

The Data Section

Following the prolog is the XML data structure for the information content of the document. Appearing here are the XML tags signifying the data nodes and data elements that compose the hierarchy of information. Although there is great flexibility in naming and organization XML tags, there are fixed rules for doing so.

Opening/Closing Tags

One of these syntactical rules is that all XML elements must have opening and closing tags. This rule can be followed in two ways.

For XML elements that enclose data or enclose child elements there must be an opening tag along with a paired closing tag. These tags are in the following format,

<tagname>

...

</tagname>

where the closing tag includes the forward slash character (/) preceding the tag name. You are probably familiar with this syntax from XHTML where, for example, a paragraph is enclosed inside <p>...</p> tags.

For XML elements that do not enclose data or child elements a single start/end tag can be used. These empty tags are in the format,

<tagname/>

where the single tag includes a forward slash following the tag name. Again, you are familiar with this format in XHTML through the <br/> tag and other similar non-paired tags. Empty tags are often used as "markers" to indicate a point in the data structure where special processing is to take place. It is unlikely, though, that you will use empty tags in the normal course of XML data structuring, where elements nearly always enclose other elements.

Case Sensitivity

XML tag names are case sensitive. You are free to use upper- or lower-case characters in composing your tags; however, you must be consistent in their use. For example, <MyTag> is an entirely different identifier from <mytag> or <MYTAG>. By convention, tag names use lower-case characters.

Character Set

XML tag names can begin with an alphabetic character or the underscore ( _ ) character and be composed of alphabetic, numeric, and other characters. They cannot begin with the characters "xml" nor can they include blank spaces. You should not use mathematical symbols or decimal points (periods) in the names. Most often you will use common, meaningful names for your tags, names that are suggestive of the data content.

At the lowest levels of the XML hierarchy, tags enclose and identify data values. On some occasions the data themselves may include the special "<" and ">" characters that signify tags. For example, if a data value includes an XHTML tag as in the following element,

<paragraph>An XHTML page begins with the <html> tag....</paragraph>

you cannot directly code "<" and ">" characters surrounding "html". These characters are interpreted as XML code by the parser and their use results in an XML document that is not well formed. A parsing error results. Instead, "<" and ">" symbols used as data values must be replaced by their character codes < and >:

<paragraph>An XHTML page begins with the<html>tag....</paragraph>

The "&" sign also has special meaning in XML. If it appears as data inside XML tags it must be replaced by its character code &.

CDATA Sections

If there is an extensive amount of character data that should not be interpreted as XML code, this text can be enclosed inside a CDATA section of the XML document. This section is identified and enclosed by the special strings <![CDATA[ and ]]>.

<![CDATA[

     ...text containing special characters

]]>

For example, the following XML <code> element includes XHTML code whose "<" and ">" characters should not be interpreted as XML symbols.

<code>
    The following tags appear at the beginning of XHTML pages:
    <![CDATA[
        <html>
        <head>
            <title>...</title>
        </head>
        </body>
    ]]>
</code>

Quoted Attributes

XML tags can have attributes, used in much the same way as XHTML tag attributes. Attributes are normally considered to be "identity" information, not primary data values like those enclosed inside named tags. Attribute values must be enclosed in quotes, either double quotes ("") or single quotes (''). Opposite quotation marks must be used if the attribute value itself includes a double quote or apostrophe.

Whether a data value is coded as an attribute or a data element is left to the discretion of the data designer. Consider, for instance, the following <Employee> node taken from the preceding <Personnel> data structure.

<Employee>
    <SSN>111-11-1111</SSN>
    <FirstName>Ann</FirstName>
    <LastName>Adams</LastName>
    <Salary>65000.00</Salary>
    <Department>Accounting</Department>
</Employee>

An alternative way to code this structure is to use the SSN value as an attribute of the <Employee> element rather than as a separate data element. Of course, all other <Employee> nodes in the document would need to include this same attribute to remain consistent.

<Employee SSN="111-11-1111"
    <FirstName>Ann</FirstName>
    <LastName>Adams</LastName>
    <Salary>65000.00</Salary>
    <Department>Accounting</Department>
</Employee>

Both of the node examples are equivalent in terms of the data values they represent. The only difference is that the SSN attribute requires additional processing to extract its value whereas the SSN child node can be processed in the same manner as the other child nodes.

As a general rule it is best to avoid use of attributes unless there is no better way to designate or assign values to data elements. If a data value is important enough to be included in the data structure, then it is probably deserving of its own data element.

Proper Nesting

XML data are enclosed inside opening and closing tag pairs which must be properly nested one inside another. That is, inner tags must be closed before the enclosing set of outer tags is closed.

<Employee>
    <SSN>111-11-1111</SSN>
    <Name>
        <FirstName>Ann</FirstName>
        <LastName>Adams</LastName>
    </Name>
    <Salary>65000.00</Salary>
    <Department>Accounting</Department>
</Employee>

The above example points out that you can choose to arrange your data within any preferred parent/child relationships. Here, the FirstName and LastName elements have been nested inside a Name parent element. The reason for doing this is to make it more convenient to extract both values through the single Name element rather than having to extract them individually. The way in which you arrange your data will depend to a large extent on how you intend to access it. There are no formal rules for when or when not to enclose elements inside a parent identifier element so long as opening and closing tags are properly paired.

Comments

Comments can be placed anywhere in the XML document to provide commentary about the document and its sections. Comment tags also can appear around document sections to disable them during debugging. XML comments use the same syntax as XHTML comments.

<!-- comment -->

Well Formed and Valid Documents

XML documents that meet the above syntactical rules are said to be well formed, meaning that the XML parser will not find error with their arrangement and coding. If a document is not well formed, the parser reports the error and ceases its evaluation of the document.

A document that is well formed, however, may not be valid. For instance, you may have inadvertently entered non-numeric characters into an element that is supposed to contain only numbers; or, perhaps you inadvertently left out a child element in one of the parent nodes. The way in which you catch an invalid structure or invalid data is by providing a Document Type Definition (DTD) against which the document can be validated. Validating XML documents is reserved for discussion later in these tutorials.

An XML document must be well formed in order to be evaluated by the parser. The document may be, but is not required to be, valid. An invalid document can be evaluated correctly by the parser although incorrect processing results may be produced. Throughout most of these tutorials consideration is given only to well-formed documents, saving the complications of data validation for a later topic.

The Document Object Model

There are two primary ways to access and process data residing in an XML file. Stream-based methods input the XML structure as a sequence of elements and data values arriving one at a time. It is up to the processing program to bring order and structure to the individual elements and to handle them as components of a larger hierarchy. Stream-based methods are often used for fast access to individual components of interest, to extract particular data elements or to duplicate a structure through simple read/write methods. Stream-based methods are applicable only to server processing of XML documents; there are no equivalent browser methods owing primarily to the fact that browsers do not permit writing files to the local computer.

Memory-based methods input and retain the entire XML document inside computer memory, making the full document accessible at one time. These methods provide direct access to any parent or child element, or their combinations, without having to read sequentially through the structure. Most XML processing is through these memory-resident methods which have equivalents in the browser and on the server.

Access to the in-memory document is through the Document Object Model (DOM). The Document Object Model is the application programming interface (API) to an XML document. The DOM provides a rich set of properties and methods for locating, extracting, and processing data elements based upon the overall structure of the document. Standards for DOM processing are maintained by the W3C and are available at W3C DOM.

Browser DOM

A Document Object Model is available in modern browsers for browser-side processing of XML documents. As meantioned previously, the MSXML parser is built into Internet Explorer. This parser recognizes XML files linked through a URL and builds a memory-resident document tree from the identified data elements. By default, this tree is displayed in the browser as an expandable-contractable list of data elements. However, the document also can be processed in one of two ways. The document can be associated with a style sheet in order to select and format elements for display as a Web page using the XHTML markup language. In addition, the document can be processed by a local script, written in the JavaScript language, to traverse the tree and apply specialized processing to data elements. These methods are used to offload XML processing from the server to the browser. Both, though, make use of the properties and methods of the local Document Object Model to effect their processing.

Server DOM

A comparable DOM is available for server-side processing. Through various .NET software classes an XML document is loaded into memory, with DOM properties and methods accessible through the Visual Basic language employed on ASP.NET pages. In this case, XML data can be read from and written to server files, or the document tree can be traversed with server scripts to perform all manner of information processing, similar in scope to processing methods used for standard files and databases. In fact, XML data structures become viable alternatives to files and databases as primary data stores for organizational information. Not only that, but XML data files can be easily exchanged between diverse computer systems owing to their standard formats and transmission protocols.

Most of the effort in learning XML is in learning the properties and methods of the XML DOM to manipulate data values contained in XML documents. The focus of these tutorials is on the XML Document Object Model and its application programming interface for both browser and server processing of XML documents.

Web Development Tutorials