Using %XML.TextReader
The %XML.TextReaderOpens in a new tab class offers a simple, easy way to read arbitrary XML documents that may or may not map directly to InterSystems IRIS® data platform objects. Specifically, this class provides ways to navigate a well-formed XML document and view the information in it (elements, attributes, comments, namespace URIs, and so on). This class also provides complete document validation, based on either a DTD or an XML schema. Unlike %XML.ReaderOpens in a new tab, however, %XML.TextReaderOpens in a new tab does not provide a way to return a DOM. If you require a DOM, see Importing XML into Objects.
The XML declaration of any XML document that you use should indicate the character encoding of that document, and the document should be encoded as declared. If the character encoding is not declared, InterSystems IRIS uses the defaults described in Character Encoding of Input and Output. If these defaults are not correct, modify the XML declaration so that it specifies the character set actually used.
Reading Arbitrary XML
To read an arbitrary XML document that does not necessarily have any relationship to an InterSystems IRIS object class, you invoke methods of the %XML.TextReaderOpens in a new tab class, which opens the document and loads it into temporary storage as a text reader object. The text reader object contains a navigable tree of nodes, each of which contains information about the source document. Your method can then navigate the document and find out information about it. Properties of the object give you information about the document that depend on your current location within the document. If there are validation errors, those errors are also available as nodes in the tree.
Overall Structure
Your method should do some or all of the following:
-
Specify a document source, via the first argument of one of the following methods:
Method First Argument ParseFile() A file name, with complete path. Note that the filename and path must contain only ASCII characters. ParseStream() A stream ParseString() A string ParseURL() A URL In any case, the source document must be a well-formed XML document; that is, it must obey the basic rules of XML syntax. Each of these methods returns a status ($$$OK or a failure code) to indicate whether the result was successful. You can test the status with the usual mechanisms; in particular, you can use $System.Status.DisplayError(status) to see the text of the error message.
For each of these methods, if the method returns $$$OK, it returns by reference (its second argument) the text reader object that contains the information in the XML document.
Additional arguments let you control entity resolution, validation, which items are found, and so on. See Argument Lists for the Parse Methods.
-
Check the status returned by the parse method and quit if appropriate.
If the parse method returned $$$OK, you have an text reader object that corresponds to the source XML document. You can navigate this object.
Your document is likely to contain nodes such as "element", "endelement", "startprefixmapping", and so on. The node types are listed in Node Types.
Important:In the case of any validation errors, your document contains "error" or "warning" nodes. Your code should check for such nodes. See Performing Validation.
-
Use one of the following instance methods to start reading the document.
-
Use Read() to navigate to the first node of the document.
-
Use ReadStartElement() to navigate to the first element of a specific type.
-
Use MoveToContent() to navigate to the first node of type "chars".
-
-
Get the values of the properties of interest for this node, if any. Available properties include Name, Value, Depth, and so on. See Node Properties.
-
Continue to navigate through the document as needed and get property values.
If the current node is an element, you can use the MoveToAttributeIndex() or MoveToAttributeName() methods to move the focus to attributes of that element. To return to the element, if applicable, use MoveToElement().
-
If needed, use the Rewind() method to return to the start of the document (before the first node). This is the only method that can go backward in the source.
After your method runs, the text reader object is destroyed and all related temporary storage is cleaned up.
Example 1
Here is a simple method that reads any XML file and shows the sequence number, type, name, and value of every node:
ClassMethod WriteNodes(myfile As %String)
{
set status=##class(%XML.TextReader).ParseFile(myfile,.textreader)
//check status
if $$$ISERR(status) {do $System.Status.DisplayError(status) quit}
//iterate through document, node by node
while textreader.Read()
{
Write !, "Node ", textreader.seq, " is a(n) "
Write textreader.NodeType," "
If textreader.Name'=""
{
Write "named: ", textreader.Name
}
Else
{
Write "and has no name"
}
Write !, " path: ",textreader.Path
If textreader.Value'=""
{
Write !, " value: ", textreader.Value
}
}
}
This example does the following:
-
It calls the ParseFile() class method. This reads the source file, creates a text reader object, and returns that in the variable doc by reference.
-
If ParseFile() is successful, the method then invokes the Read() method to find each successive node within the document.
-
For each node, the method writes output lines that contain the sequence number of the node, the node type, the node name (if any), the node path, and the node value (if any). Output is written to the current device.
Consider the following example source document:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="mystyles.css"?>
<Root>
<s01:Person xmlns:s01="http://www.root.org">
<Name attr="xyz">Willeke,Clint B.</Name>
<DOB>1925-10-01</DOB>
</s01:Person>
</Root>
For this source document, the preceding method generates the following output:
Node 1 is a(n) processinginstruction named: xml-stylesheet
path:
value: type="text/css" href="mystyles.css"
Node 2 is a(n) element named: Root
path: /Root
Node 3 is a(n) startprefixmapping named: s01
path: /Root
value: s01 http://www.root.org
Node 4 is a(n) element named: s01:Person
path: /Root/s01:Person
Node 5 is a(n) element named: Name
path: /Root/s01:Person/Name
Node 6 is a(n) chars and has no name
path: /Root/s01:Person/Name
value: Willeke,Clint B.
Node 7 is a(n) endelement named: Name
path: /Root/s01:Person/Name
Node 8 is a(n) element named: DOB
path: /Root/s01:Person/DOB
Node 9 is a(n) chars and has no name
path: /Root/s01:Person/DOB
value: 1925-10-01
Node 10 is a(n) endelement named: DOB
path: /Root/s01:Person/DOB
Node 11 is a(n) endelement named: s01:Person
path: /Root/s01:Person
Node 12 is a(n) endprefixmapping named: s01
path: /Root
value: s01
Node 13 is a(n) endelement named: Root
path: /Root
Notice that the comment has been ignored; by default, the %XML.TextReaderOpens in a new tab class ignores comments. For information on changing this, see Argument Lists for the Parse Methods.
Example 2
The following example reads an XML file and lists every element in it:
ClassMethod ShowElements(myfile As %String)
{
set status = ##class(%XML.TextReader).ParseFile(myfile,.textreader)
//check status
if $$$ISERR(status) {do $System.Status.DisplayError(status) quit}
//iterate through document, node by node
while textreader.Read()
{
if (textreader.NodeType = "element")
{
write textreader.Name,!
}
}
}
This method checks the type of each node, by using the NodeType property. If the node is an element, the method prints its name to the current device. For the XML source document shown earlier, this method generates the following output:
Root
s01:Person
Name
DOB
Node Types
Each node of a document is one of the following types:
Type | Description |
---|---|
"attribute" | An XML attribute. |
"chars" | A set of characters (such as content of an element).
The %XML.TextReaderOpens in a new tab class recognizes other node types ("CDATA", "EntityReference", and "EndEntity") but automatically converts them to "chars". |
"comment" | An XML comment. |
"element" | The start of an XML element. |
"endelement" | The end of an XML element. |
"endprefixmapping" | End of the context where a namespace is declared. |
"entity" | An XML entity. |
"error" | A validation error found by the parser. See Performing Validation. |
"ignorablewhitespace" | The white space between markup in a mixed content model. |
"processinginstruction" | An XML processing instruction. |
"startprefixmapping" | An XML namespace declaration, which may or may not include a namespace. |
"warning" | A validation warning found by the parser. See Performing Validation. |
Notice that an XML element consists of multiple nodes. For example, consider the following XML fragment:
<Person>
<Name>Willeke,Clint B.</Name>
<DOB>1925-10-01</DOB>
</Person>
The SAX parser views this XML as the following set of nodes:
Node Number | Type of Node | Name of Node, If Any | Value of Node, If Any |
---|---|---|---|
1 | element | Person | |
2 | element | Name | |
3 | chars | Willeke,Clint B. | |
4 | endelement | Name | |
5 | element | DOB | |
6 | chars | 1925-10-01 | |
7 | endelement | DOB | |
8 | endelement | Person |
For example, notice that the <DOB> element is considered to be three nodes: an element node, a chars node, and an endelement node. Also notice that the contents of this element are available only as the value of the chars node.
Node Properties
The %XML.TextReaderOpens in a new tab class parses an XML document and creates an text reader object that consists of a set of nodes that correspond to the components of the document; the node types are described in Document Nodes.
When you change focus to a different node, the properties of the text reader object are updated to contain information about the node that you are currently examining. This section describes all the properties of the %XML.TextReaderOpens in a new tab class.
If the current node is an element or an attribute, this property indicates the number of attributes of the element. Within a given element, the first attribute is numbered 1.
For any other type of node, this property is 0.
Indicates the depth of the current node within the document. The root element is at depth 1; items outside the root element are at depth 0. Note that an attribute is at the same depth as the element to which it belongs. Similarly, an error or warning is at the same depth as the item that caused the error or warning.
True if the reader has reached the end of the source document; false otherwise.
If the current node is an element, this property is true if that element has attributes (or false if it does not). If the current node is an attribute, this property is true.
For any other type of node, this property is false.
True if the current node is a type of node that has a value (even if that value is null). Otherwise this property is false. Specifically, this property is true for the following types of nodes:
-
attribute
-
chars
-
comment
-
entity
-
ignorablewhitespace
-
processinginstruction
-
startprefixmapping
Note that HasValue is false for nodes of type error and warning, even though those node types have values.
True if the current node is an element and is empty. Otherwise this property is false.
For nodes of type attribute, element, or endelement, this is the name of the current element or attribute, without the namespace prefix. For all other types of nodes, this property is null.
Fully qualified name of the current node, as appropriate for the type of node. The following table gives the details:
Node Type | Name and Example |
---|---|
attribute | The name of the attribute. For example, if an attribute is:
groupID="GX078" then Name is: groupID |
element
or endelement |
The name of the element. For example, if an element is:
<s01:Person groupID="GX078">...</s01:Person> then Name is: s01:Person |
entity | The name of the entity. |
startprefixmapping
or endprefixmapping |
The prefix, if any. For example, if a namespace declaration is as follows:
xmlns:s01="http://www.root.org" then Name is: s01 For another example, if a namespace declaration is as follows: xmlns="http://www.root.org" then Name is null. |
processinginstruction | The target of the processing instruction. For example, if a processing instruction is:
<?xml-stylesheet type="text/css" href="mystyles.css"?> then Name is: xml-stylesheet |
all other types | null |
For nodes of type attribute, element, or endelement, this is the namespace to which attribute or element belongs, if any. For all other types of nodes, this property is null.
Type of the current node. See Document Nodes.
Path to the element. For example, consider the following XML document:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="mystyles.css"?>
<s01:Root xmlns:s01="http://www.root.org" xmlns="www.default.org">
<Person>
<Name>Willeke,Clint B.</Name>
<DOB>1925-10-01</DOB>
<GroupID>U3577</GroupID>
<Address xmlns="www.address.org">
<City>Newton</City>
<Zip>56762</Zip>
</Address>
</Person>
</s01:Root>
For the City element, the Path property is /s01:Root/Person/Address/City. Other elements are treated similarly.
Indicates the overall state of the text reader object, one of the following:
-
"Initial" means that the Read() method has not yet been called.
-
"Interactive" means that the Read() method has been called at least once.
-
"EndOfFile" means that the end of the file has been reached.
Value, if any, of the current node, as appropriate for the type of node. The following table gives the details:
Node Type | Value and Example |
---|---|
attribute | The value of the attribute. For example, if an attribute is:
groupID="GX078" then Value is: GX078 |
chars | The content of the text node. For example, if an element is:
<DOB>1925-10-01</DOB> then for the chars node, Value is: 1925-10-01 |
comment | The content of the comment. For example, if a comment is:
<!--Comment here--> then Value is: Comment here |
entity | The definition of the entity. |
error | The error message. For an example, see Performing Validation. |
ignorablewhitespace | The content of the white space. |
processinginstruction | The entire content of the processing instruction, excluding the target. For example, if a processing instruction is:
<?xml-stylesheet type="text/css" href="mystyles.css"?> then Value is: type="text/css" href="mystyles.css"? |
startprefixmapping | The prefix, followed by a space, followed by the URI. For example, if a namespace declaration is as follows:
xmlns:s01="http://www.root.org" then Value is: s01 http://www.root.org |
warning | The warning message. For an example, see Performing Validation. |
all other types (including element) | null |
The sequence number of this node within the document. The first node is numbered 1. Note that an attribute has the same sequence number as the element to which it belongs.
Argument Lists for the Parse Methods
To specify a document source, you use the ParseFile(), ParseStream(), ParseString(), or ParseURL() method of your text reader. In any case, the source document must be a well-formed XML document; that is, it must obey the basic rules of XML syntax. For these methods, only the first two arguments are required. For reference, these methods have the following arguments, in order:
-
Filename, Stream, String, or URL — Document source.
Note that for ParseFile(), the Filename argument must contain only ASCII characters.
-
TextReader — Text reader object, returned as an output parameter if the method returns $$$OK.
-
Resolver — An entity resolver to use when parsing the source. See Performing Custom Entity Resolution in Customizing How the SAX Parser Is Used.
-
Flags — A flag or combination of flags to control the validation and processing performed by the SAX parser. See Setting the Parser Flags in Customizing How the SAX Parser Is Used.
-
Mask — A mask to specify which items are of interest in the XML source. See Specifying the Event Mask in Customizing How the SAX Parser Is Used.
Tip:For the parsing methods of %XML.TextReaderOpens in a new tab, the default mask is $$$SAXCONTENTEVENTS. Note that this ignores comments. To parse all possible types of nodes, use $$$SAXALLEVENTS for this argument. Note that these macros are defined in the %occSAX.inc include file.
-
SchemaSpec — A schema specification, against which to validate the document source. This argument is a string that contains a comma-separated list of namespace/URL pairs:
"namespace URL,namespace URL"
Here namespace is the XML namespace used for the schema and URL is a URL that gives the location of the schema document. There is a single space character between the namespace and URL values.
-
KeepWhiteSpace — An option to keep white space or not.
-
pHttpRequest — (For the ParseURL() method only) A request for the web server, as an instance of %Net.HttpRequestOpens in a new tab. By default, the system creates a new instance of %Net.HttpRequestOpens in a new tab and uses that, but you can instead make a request with a different instance of %Net.HttpRequestOpens in a new tab. This is useful in the case where you have a pre-existing %Net.HttpRequestOpens in a new tab with proxy and other properties already set. This option applies only to URLs of type http (not file or ftp, for example).
For details on %Net.HttpRequestOpens in a new tab, see Using Internet Utilities. Or see the class documentation for %Net.HttpRequestOpens in a new tab.
Performing Validation
By default, the source document is validated against any DTD or schema document provided. If the document includes a DTD section, the document is validating against that DTD. To validate against a schema document instead, specify the schema within the argument list for ParseFile(), ParseStream(), ParseString(), or ParseURL(), as described in Argument Lists for the Parse Methods.
Most types of validation issues are nonfatal and cause either an error or a warning. Specifically, nodes of type "error" or "warning" are automatically added to the document tree, at the location where the error occurred. You can navigate to and inspect these nodes in the same way as any other type of node.
For example, consider the following XML document:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Root [
<!ELEMENT Root (Person)>
<!ELEMENT Person (#PCDATA)>
]>
<Root>
<Person>Smith,Joe C.</Person>
</Root>
In this case, we do not expect any validation errors. Recall the example method WriteNodes() shown earlier in this topic. If we used that method to read this document, the output would be as follows:
Node 1 is a(n) element named: Root
and has no value
Node 2 is a(n) ignorablewhitespace and has no name
with value:
Node 3 is a(n) element named: Person
and has no value
Node 4 is a(n) chars and has no name
with value: Smith,Joe C.
Node 5 is a(n) endelement named: Person
and has no value
Node 6 is a(n) ignorablewhitespace and has no name
with value:
Node 7 is a(n) endelement named: Root
and has no value
In contrast, suppose that the file looked like this instead:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Root [
<!ELEMENT Root (Person)>
<!ELEMENT Person (#PCDATA)>
]>
<Root>
<Employee>Smith,Joe C.</Employee>
</Root>
In this case, we expect errors because the <Employee> element is not declared in the DTD section. Here, if we use the example method WriteNodes() to read this document, the output would be as follows:
Node 1 is a(n) element named: Root
and has no value
Node 2 is a(n) ignorablewhitespace and has no name
with value:
Node 3 is a(n) error and has no name
with value: Unknown element 'Employee'
while processing c:/TextReader/docwdtd2.txt at line 7 offset 14
Node 4 is a(n) element named: Employee
and has no value
Node 5 is a(n) chars and has no name
with value: Smith,Joe C.
Node 6 is a(n) endelement named: Employee
and has no value
Node 7 is a(n) ignorablewhitespace and has no name
with value:
Node 8 is a(n) error and has no name
with value: Element 'Employee' is not valid for content model: '(Person)'
while processing c:/TextReader/docwdtd2.txt at line 8 offset 8
Node 9 is a(n) endelement named: Root
and has no value
Also see Setting the Parser Flags in Customizing How the SAX Parser Is Used.
Examples: Namespace Reporting
The following example method reads an arbitrary XML file and indicates the namespaces to which each element and attribute belongs:
ClassMethod ShowNamespacesInFile(filename As %String)
{
Set status = ##class(%XML.TextReader).ParseFile(filename,.textreader)
//check status
If $$$ISERR(status) {do $System.Status.DisplayError(status) quit}
//iterate through document, node by node
While textreader.Read()
{
If (textreader.NodeType = "element")
{
Write !,"The element ",textreader.LocalName
Write " is in the namespace ",textreader.NamespaceUri
}
If (textreader.NodeType = "attribute")
{
Write !,"The attribute ",textreader.LocalName
Write " is in the namespace ",textreader.NamespaceUri
}
}
}
When used in the Terminal, this method produces output like the following:
The element Person is in the namespace www://www.person.com
The element Name is in the namespace www://www.person.com
The following variation accepts an XML-enabled object, writes it to a stream, and then uses that stream to generate the same type of report:
ClassMethod ShowNamespacesInObject(obj)
{
set writer=##class(%XML.Writer).%New()
set str=##class(%GlobalCharacterStream).%New()
set status=writer.OutputToStream(str)
if $$$ISERR(status) {do $System.Status.DisplayError(status) quit ""}
//write to the stream
set status=writer.RootObject(obj)
if $$$ISERR(status) {do $System.Status.DisplayError(status) quit }
Set status = ##class(%XML.TextReader).ParseStream(str,.textreader)
//check status
If $$$ISERR(status) {do $System.Status.DisplayError(status) quit}
//iterate through document, node by node
While textreader.Read()
{
If (textreader.NodeType = "element")
{
Write !,"The element ",textreader.LocalName
Write " is in the namespace ",textreader.NamespaceUri
}
If (textreader.NodeType = "attribute")
{
Write !,"The attribute ",textreader.LocalName
Write " is in the namespace ",textreader.NamespaceUri
}
}
}