XML Document Layout and Syntax
An XML document contains a hierarchy of data values wrapped inside XML elements
and includes various other entries to assist in processing this information. The
data structure and processing entries must conform to exacting rules.
Standards and specifications for designing XML documents are maintained by the
World Wide Web Consortium (W3C) and can be found
here. These recommendations
give the structural and syntactical rules for creating XML documents
and standards for testing their validity.
An XML document contains two major sections -- the prolog
and data structure sections. The prolog identifies the document
as an XML document and includes other optional entries for specialized processing.
The data structure section includes the hierarchy of data elements representing
the document's information content.
The Prolog
The prolog section of an XML document contains three optional elements:
- XML declaration
- Processing instruction
- Document type definition
The XML Declaration
The XML declaration, if included, must appear as the first line
of the document. The declaration consists of a mandatory version number, an
optional encoding declaration specifying the character set used, and an optional
indication of the stand-alone nature of the document.
<?xml version="1.0"
[encoding="UTF-8"] [standalone="yes"] ?>
The current version is 1.0 and must be included. An encoding declaration is needed
only when using a non-Latin (non-English) character set. The default character
set is UTF-8 and does not need to be coded. The standalone="no" attribute is
required when reference is made to an external file containing additional
declarations for the current document (see "Document Type Definition" below).
The default value is standalone="yes" (this is a stand-alone document) and the
attribute can be omitted.
An XML declaration is optional, but it is always a good idea to include it since
W3C recommendations state that a "well-formed" document begins with an XML
declaration.
Processing Instructions
A processing instruction is a reference to a supporting
document containing processing instructions relevent to the XML document.
Most likely, this processing instruction refers to an external style sheet
used to transform the document in some fashion, normally to format the XML
data content for Web page display. A processing instruction that links to an
external style sheet appears as shown below.
<?xml-stylesheet type="
text.xsl
" href="
url"
?>
Working with XML style sheets to perform transformations of XML documents
into Web pages and into other XML documents is considered later in these tutorials.
Document Type Definition
A document type (DOCTYPE) declaration is required if the XML document is to be
validated against a Document Type Definition (DTD).
A DTD is a set of rules for interpreting the data elements and their relationships.
For example, DTD entries can indicate proper names and valid data types for
data elements in the document. The general formats for DOCTYPE declarations are
shown below.
<!DOCTYPE
root node
[ <!ELEMENT
declaration
> ... ]>
<!DOCTYPE
root node
SYSTEM
"url"
>
An internal DTD can be coded as part of the XML document, or an external
DTD can be coded in a separate document. If DTD specifications are in an external
document linked by the url, then the attribute standalone="no" must be
coded in the XML declaration. An XML document that satisfies DTD specifications
is said to be "valid" in addition to being well-formed. DTD validation is considered
later in these tutorials.
The Data Section
Following the prolog is the XML data structure for the information content of the
document. Appearing here are the XML tags signifying the data nodes and data
elements that compose the hierarchy of information. Although there is great
flexibility in naming and organization XML tags, there are fixed rules for doing
so.
Opening/Closing Tags
One of these syntactical rules is that all XML elements must have opening and
closing tags. This rule can be followed in two ways.
For XML elements that enclose data or enclose child elements there must be an
opening tag along with a paired closing tag. These tags are in the
following format,
<tagname>
...
</tagname>
where the closing tag includes the forward slash character (/) preceding the
tag name. You are probably familiar with this syntax from XHTML where, for
example, a paragraph is enclosed inside <p>
...</p>
tags.
For XML elements that do not enclose data or child elements a single start/end
tag can be used. These empty tags are in the format,
<tagname/>
where the single tag includes a forward slash following the tag name.
Again, you are familiar with this format in XHTML through the
<br/>
tag and other similar non-paired tags. Empty tags
are often used as "markers" to indicate a point in the data structure where
special processing is to take place. It is unlikely, though, that you will use
empty tags in the normal course of XML data structuring, where elements nearly
always enclose other elements.
Case Sensitivity
XML tag names are case sensitive. You are free to use upper- or lower-case
characters in composing your tags; however, you must be consistent in their use.
For example, <MyTag>
is an entirely different identifier from
<mytag>
or <MYTAG>
. By convention, tag
names use lower-case characters.
Character Set
XML tag names can begin with an alphabetic character or the underscore ( _ )
character and be composed of alphabetic, numeric, and other characters.
They cannot begin with the characters "xml" nor can they include blank spaces.
You should not use mathematical symbols or decimal points (periods) in the names.
Most often you will use common, meaningful names for your tags, names that are
suggestive of the data content.
At the lowest levels of the XML hierarchy, tags enclose and identify data values.
On some occasions the data themselves may include the special "<" and ">"
characters that signify tags. For example, if a data value includes an XHTML tag
as in the following element,
<paragraph>An XHTML page begins with the <html> tag....</paragraph>
you cannot directly code "<" and ">" characters surrounding "html". These
characters are interpreted as XML code by the parser and their use results in an
XML document that is not well formed. A parsing error results. Instead,
"<" and ">" symbols used as data values must be replaced by their character codes
<
and >
:
<paragraph>An XHTML page begins with the
<html>
tag....</paragraph>
The "&" sign also has special meaning in XML. If it appears as data inside XML
tags it must be replaced by its character code &
.
CDATA Sections
If there is an extensive amount of character data that should not be interpreted as
XML code, this text can be enclosed inside a CDATA section of the
XML document. This section is identified and enclosed by the special strings
<![CDATA[
and ]]>
.
<![CDATA[
...text containing special characters
]]>
For example, the following XML <code>
element includes XHTML code
whose "<" and ">" characters should not be interpreted as XML symbols.
<code>
The following tags appear at the beginning of XHTML pages:
<![CDATA[
<html>
<head>
<title>...</title>
</head>
</body>
]]>
</code>
Quoted Attributes
XML tags can have attributes, used in much the same way as XHTML
tag attributes. Attributes are normally considered to be "identity" information,
not primary data values like those enclosed inside named tags. Attribute values
must be enclosed in quotes, either double quotes ("") or single quotes ('').
Opposite quotation marks must be used if the attribute value itself includes a
double quote or apostrophe.
Whether a data value is coded as an attribute or a data element is left to the
discretion of the data designer. Consider, for instance, the following
<Employee>
node taken from the preceding
<Personnel>
data structure.
<Employee>
<SSN>111-11-1111</SSN>
<FirstName>Ann</FirstName>
<LastName>Adams</LastName>
<Salary>65000.00</Salary>
<Department>Accounting</Department>
</Employee>
An alternative way to code this structure is to use the SSN value as an
attribute of the <Employee>
element rather than as a separate
data element. Of course, all other <Employee>
nodes in the
document would need to include this same attribute to remain consistent.
<Employee SSN="111-11-1111"
<FirstName>Ann</FirstName>
<LastName>Adams</LastName>
<Salary>65000.00</Salary>
<Department>Accounting</Department>
</Employee>
Both of the node examples are equivalent in terms of the data values they
represent. The only difference is that the SSN attribute requires additional
processing to extract its value whereas the SSN child node can be processed in
the same manner as the other child nodes.
As a general rule it is best to avoid use of attributes unless there is no better
way to designate or assign values to data elements. If a data value is important
enough to be included in the data structure, then it is probably deserving of its
own data element.
Proper Nesting
XML data are enclosed inside opening and closing tag pairs which must be properly
nested one inside another. That is, inner tags must be closed before the enclosing
set of outer tags is closed.
<Employee>
<SSN>
111-11-1111
</SSN>
<Name>
<FirstName>
Ann
</FirstName>
<LastName>
Adams
</LastName>
</Name>
<Salary>
65000.00
</Salary>
<Department>
Accounting
</Department>
</Employee>
The above example points out that you can choose to arrange your data within any
preferred parent/child relationships. Here, the FirstName and LastName elements
have been nested inside a Name parent element. The reason for doing this is to make
it more convenient to extract both values through the single Name element rather
than having to extract them individually. The way in which you arrange your data will
depend to a large extent on how you intend to access it. There are no formal rules
for when or when not to enclose elements inside a parent identifier element so long
as opening and closing tags are properly paired.
Comments
Comments can be placed anywhere in the XML document to provide commentary about
the document and its sections. Comment tags also can appear around document
sections to disable them during debugging. XML comments use the same syntax as
XHTML comments.
<!-- comment -->
Well Formed and Valid Documents
XML documents that meet the above syntactical rules are said to be
well formed, meaning that the XML parser will not find error with
their arrangement and coding. If a document is not well formed, the parser reports
the error and ceases its evaluation of the document.
A document that is well formed, however, may not be valid. For
instance, you may have inadvertently entered non-numeric characters into an
element that is supposed to contain only numbers; or, perhaps you inadvertently
left out a child element in one of the parent nodes. The way in which you catch
an invalid structure or invalid data is by providing a Document Type Definition
(DTD) against which the document can be validated. Validating XML documents is
reserved for discussion later in these tutorials.
An XML document must be well formed in order to be evaluated by the parser.
The document may be, but is not required to be, valid. An invalid document
can be evaluated correctly by the parser although incorrect processing results may
be produced. Throughout most of these tutorials consideration is given only to
well-formed documents, saving the complications of data validation for a later
topic.
The Document Object Model
There are two primary ways to access and process data residing in an XML file.
Stream-based methods input the XML structure as a sequence of
elements and data values arriving one at a time. It is up to the processing
program to bring order and structure to the individual elements and to handle
them as components of a larger hierarchy. Stream-based methods are often used
for fast access to individual components of interest, to extract particular
data elements or to duplicate a structure through simple read/write methods.
Stream-based methods are applicable only to server processing of XML documents;
there are no equivalent browser methods owing primarily to the fact that browsers
do not permit writing files to the local computer.
Memory-based methods input and retain the entire XML document
inside computer memory, making the full document accessible at one time. These
methods provide direct access to any parent or child element, or their combinations,
without having to read sequentially through the structure. Most XML processing is
through these memory-resident methods which have equivalents in the browser and on
the server.
Access to the in-memory document is through the
Document Object Model (DOM). The Document Object Model is the
application programming interface (API) to an XML document. The DOM provides a
rich set of properties and methods for locating, extracting, and processing data
elements based upon the overall structure of the document. Standards for DOM
processing are maintained by the W3C and are available at
W3C DOM.
Browser DOM
A Document Object Model is available in modern browsers for browser-side processing
of XML documents. As meantioned previously, the MSXML parser is built into Internet
Explorer. This parser recognizes XML files linked through a URL and builds a
memory-resident document tree from the identified data elements. By default, this
tree is displayed in the browser as an expandable-contractable list of data elements.
However, the document also can be processed in one of two ways. The document can be
associated with a style sheet in order to select and format elements for display as
a Web page using the XHTML markup language. In addition, the document can be
processed by a local script, written in the JavaScript language, to traverse the
tree and apply specialized processing to data elements. These methods are used to
offload XML processing from the server to the browser. Both, though, make use of
the properties and methods of the local Document Object Model to effect their
processing.
Server DOM
A comparable DOM is available for server-side processing. Through various .NET
software classes an XML document is loaded into memory, with DOM properties
and methods accessible through the Visual Basic language employed on ASP.NET
pages. In this case, XML data can be read from and written to server files, or
the document tree can be traversed with server scripts to perform all manner of
information processing, similar in scope to processing methods used for standard
files and databases. In fact, XML data structures become viable alternatives to
files and databases as primary data stores for organizational information. Not
only that, but XML data files can be easily exchanged between diverse computer
systems owing to their standard formats and transmission protocols.
Most of the effort in learning XML is in learning the properties and methods of
the XML DOM to manipulate data values contained in XML documents. The focus of
these tutorials is on the XML Document Object Model and its application programming
interface for both browser and server processing of XML documents.