Skip to main content

Using %XML.TextReader

The %XML.TextReaderOpens in a new tab class offers a simple, easy way to read arbitrary XML documents that may or may not map directly to InterSystems IRIS® data platform objects. Specifically, this class provides ways to navigate a well-formed XML document and view the information in it (elements, attributes, comments, namespace URIs, and so on). This class also provides complete document validation, based on either a DTD or an XML schema. Unlike %XML.ReaderOpens in a new tab, however, %XML.TextReaderOpens in a new tab does not provide a way to return a DOM. If you require a DOM, see Importing XML into Objects.

Note:

The XML declaration of any XML document that you use should indicate the character encoding of that document, and the document should be encoded as declared. If the character encoding is not declared, InterSystems IRIS uses the defaults described in Character Encoding of Input and Output. If these defaults are not correct, modify the XML declaration so that it specifies the character set actually used.

Reading Arbitrary XML

To read an arbitrary XML document that does not necessarily have any relationship to an InterSystems IRIS object class, you invoke methods of the %XML.TextReaderOpens in a new tab class, which opens the document and loads it into temporary storage as a text reader object. The text reader object contains a navigable tree of nodes, each of which contains information about the source document. Your method can then navigate the document and find out information about it. Properties of the object give you information about the document that depend on your current location within the document. If there are validation errors, those errors are also available as nodes in the tree.

Overall Structure

Your method should do some or all of the following:

  1. Specify a document source, via the first argument of one of the following methods:

    Method First Argument
    ParseFile() A file name, with complete path. Note that the filename and path must contain only ASCII characters.
    ParseStream() A stream
    ParseString() A string
    ParseURL() A URL

    In any case, the source document must be a well-formed XML document; that is, it must obey the basic rules of XML syntax. Each of these methods returns a status ($$$OK or a failure code) to indicate whether the result was successful. You can test the status with the usual mechanisms; in particular, you can use $System.Status.DisplayError(status) to see the text of the error message.

    For each of these methods, if the method returns $$$OK, it returns by reference (its second argument) the text reader object that contains the information in the XML document.

    Additional arguments let you control entity resolution, validation, which items are found, and so on. See Argument Lists for the Parse Methods.

  2. Check the status returned by the parse method and quit if appropriate.

    If the parse method returned $$$OK, you have an text reader object that corresponds to the source XML document. You can navigate this object.

    Your document is likely to contain nodes such as "element", "endelement", "startprefixmapping", and so on. The node types are listed in Node Types.

    Important:

    In the case of any validation errors, your document contains "error" or "warning" nodes. Your code should check for such nodes. See Performing Validation.

  3. Use one of the following instance methods to start reading the document.

    • Use Read() to navigate to the first node of the document.

    • Use ReadStartElement() to navigate to the first element of a specific type.

    • Use MoveToContent() to navigate to the first node of type "chars".

    See Navigating the Document.

  4. Get the values of the properties of interest for this node, if any. Available properties include Name, Value, Depth, and so on. See Node Properties.

  5. Continue to navigate through the document as needed and get property values.

    If the current node is an element, you can use the MoveToAttributeIndex() or MoveToAttributeName() methods to move the focus to attributes of that element. To return to the element, if applicable, use MoveToElement().

  6. If needed, use the Rewind() method to return to the start of the document (before the first node). This is the only method that can go backward in the source.

After your method runs, the text reader object is destroyed and all related temporary storage is cleaned up.

Example 1

Here is a simple method that reads any XML file and shows the sequence number, type, name, and value of every node:

ClassMethod WriteNodes(myfile As %String)
{
    set status=##class(%XML.TextReader).ParseFile(myfile,.textreader)
    //check status
    if $$$ISERR(status) {do $System.Status.DisplayError(status) quit}
    //iterate through document, node by node
    while textreader.Read()
    {
        Write !, "Node ", textreader.seq, " is a(n) "
        Write textreader.NodeType," "
        If textreader.Name'=""
        {
            Write "named: ", textreader.Name
            }
            Else
            {
                Write "and has no name"
                }
        Write !, "    path: ",textreader.Path
        If textreader.Value'="" 
        {
            Write !, "    value: ", textreader.Value
            }
        }
}

This example does the following:

  1. It calls the ParseFile() class method. This reads the source file, creates a text reader object, and returns that in the variable doc by reference.

  2. If ParseFile() is successful, the method then invokes the Read() method to find each successive node within the document.

  3. For each node, the method writes output lines that contain the sequence number of the node, the node type, the node name (if any), the node path, and the node value (if any). Output is written to the current device.

Consider the following example source document:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="mystyles.css"?>
<Root>
   <s01:Person xmlns:s01="http://www.root.org">
      <Name attr="xyz">Willeke,Clint B.</Name>
      <DOB>1925-10-01</DOB>
   </s01:Person>
</Root>

For this source document, the preceding method generates the following output:

Node 1 is a(n) processinginstruction named: xml-stylesheet
    path:
    value: type="text/css" href="mystyles.css"
Node 2 is a(n) element named: Root
    path: /Root
Node 3 is a(n) startprefixmapping named: s01
    path: /Root
    value: s01 http://www.root.org
Node 4 is a(n) element named: s01:Person
    path: /Root/s01:Person
Node 5 is a(n) element named: Name
    path: /Root/s01:Person/Name
Node 6 is a(n) chars and has no name
    path: /Root/s01:Person/Name
    value: Willeke,Clint B.
Node 7 is a(n) endelement named: Name
    path: /Root/s01:Person/Name
Node 8 is a(n) element named: DOB
    path: /Root/s01:Person/DOB
Node 9 is a(n) chars and has no name
    path: /Root/s01:Person/DOB
    value: 1925-10-01
Node 10 is a(n) endelement named: DOB
    path: /Root/s01:Person/DOB
Node 11 is a(n) endelement named: s01:Person
    path: /Root/s01:Person
Node 12 is a(n) endprefixmapping named: s01
    path: /Root
    value: s01
Node 13 is a(n) endelement named: Root
    path: /Root

Notice that the comment has been ignored; by default, the %XML.TextReaderOpens in a new tab class ignores comments. For information on changing this, see Argument Lists for the Parse Methods.

Example 2

The following example reads an XML file and lists every element in it:

ClassMethod ShowElements(myfile As %String)
{
    set status = ##class(%XML.TextReader).ParseFile(myfile,.textreader)
    //check status
    if $$$ISERR(status) {do $System.Status.DisplayError(status) quit}
    //iterate through document, node by node
    while textreader.Read()
    {
        if (textreader.NodeType = "element") 
        {
            write textreader.Name,!
            }
        }
}

This method checks the type of each node, by using the NodeType property. If the node is an element, the method prints its name to the current device. For the XML source document shown earlier, this method generates the following output:

Root
s01:Person
Name
DOB

Node Types

Each node of a document is one of the following types:

Node Types in a Text Reader Document
Type Description
"attribute" An XML attribute.
"chars" A set of characters (such as content of an element).

The %XML.TextReaderOpens in a new tab class recognizes other node types ("CDATA", "EntityReference", and "EndEntity") but automatically converts them to "chars".

"comment" An XML comment.
"element" The start of an XML element.
"endelement" The end of an XML element.
"endprefixmapping" End of the context where a namespace is declared.
"entity" An XML entity.
"error" A validation error found by the parser. See Performing Validation.
"ignorablewhitespace" The white space between markup in a mixed content model.
"processinginstruction" An XML processing instruction.
"startprefixmapping" An XML namespace declaration, which may or may not include a namespace.
"warning" A validation warning found by the parser. See Performing Validation.

Notice that an XML element consists of multiple nodes. For example, consider the following XML fragment:

<Person>
   <Name>Willeke,Clint B.</Name>
   <DOB>1925-10-01</DOB>
</Person>

The SAX parser views this XML as the following set of nodes:

Example of Document Nodes
Node Number Type of Node Name of Node, If Any Value of Node, If Any
1 element Person  
2 element Name  
3 chars   Willeke,Clint B.
4 endelement Name  
5 element DOB  
6 chars   1925-10-01
7 endelement DOB  
8 endelement Person  

For example, notice that the <DOB> element is considered to be three nodes: an element node, a chars node, and an endelement node. Also notice that the contents of this element are available only as the value of the chars node.

Node Properties

The %XML.TextReaderOpens in a new tab class parses an XML document and creates an text reader object that consists of a set of nodes that correspond to the components of the document; the node types are described in Document Nodes.

When you change focus to a different node, the properties of the text reader object are updated to contain information about the node that you are currently examining. This section describes all the properties of the %XML.TextReaderOpens in a new tab class.

AttributeCount

If the current node is an element or an attribute, this property indicates the number of attributes of the element. Within a given element, the first attribute is numbered 1.

For any other type of node, this property is 0.

Depth

Indicates the depth of the current node within the document. The root element is at depth 1; items outside the root element are at depth 0. Note that an attribute is at the same depth as the element to which it belongs. Similarly, an error or warning is at the same depth as the item that caused the error or warning.

EOF

True if the reader has reached the end of the source document; false otherwise.

HasAttributes

If the current node is an element, this property is true if that element has attributes (or false if it does not). If the current node is an attribute, this property is true.

For any other type of node, this property is false.

HasValue

True if the current node is a type of node that has a value (even if that value is null). Otherwise this property is false. Specifically, this property is true for the following types of nodes:

  • attribute

  • chars

  • comment

  • entity

  • ignorablewhitespace

  • processinginstruction

  • startprefixmapping

Note that HasValue is false for nodes of type error and warning, even though those node types have values.

IsEmptyElement

True if the current node is an element and is empty. Otherwise this property is false.

LocalName

For nodes of type attribute, element, or endelement, this is the name of the current element or attribute, without the namespace prefix. For all other types of nodes, this property is null.

Name

Fully qualified name of the current node, as appropriate for the type of node. The following table gives the details:

Names for Nodes, by Type
Node Type Name and Example
attribute The name of the attribute. For example, if an attribute is:

groupID="GX078"

then Name is:

groupID

element

or

endelement

The name of the element. For example, if an element is:

<s01:Person groupID="GX078">...</s01:Person>

then Name is:

s01:Person

entity The name of the entity.
startprefixmapping

or

endprefixmapping

The prefix, if any. For example, if a namespace declaration is as follows:

xmlns:s01="http://www.root.org"

then Name is:

s01

For another example, if a namespace declaration is as follows:

xmlns="http://www.root.org"

then Name is null.

processinginstruction The target of the processing instruction. For example, if a processing instruction is:

<?xml-stylesheet type="text/css" href="mystyles.css"?>

then Name is:

xml-stylesheet

all other types null
NamespaceUri

For nodes of type attribute, element, or endelement, this is the namespace to which attribute or element belongs, if any. For all other types of nodes, this property is null.

NodeType

Type of the current node. See Document Nodes.

Path

Path to the element. For example, consider the following XML document:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="mystyles.css"?>
<s01:Root xmlns:s01="http://www.root.org" xmlns="www.default.org">
   <Person>
      <Name>Willeke,Clint B.</Name>
      <DOB>1925-10-01</DOB>
      <GroupID>U3577</GroupID>
      <Address xmlns="www.address.org">
         <City>Newton</City>
         <Zip>56762</Zip>
      </Address>
   </Person>
</s01:Root>

For the City element, the Path property is /s01:Root/Person/Address/City. Other elements are treated similarly.

ReadState

Indicates the overall state of the text reader object, one of the following:

  • "Initial" means that the Read() method has not yet been called.

  • "Interactive" means that the Read() method has been called at least once.

  • "EndOfFile" means that the end of the file has been reached.

Value

Value, if any, of the current node, as appropriate for the type of node. The following table gives the details:

Values for Nodes, by Type
Node Type Value and Example
attribute The value of the attribute. For example, if an attribute is:

groupID="GX078"

then Value is:

GX078

chars The content of the text node. For example, if an element is:

<DOB>1925-10-01</DOB>

then for the chars node, Value is:

1925-10-01

comment The content of the comment. For example, if a comment is:

<!--Comment here-->

then Value is:

Comment here

entity The definition of the entity.
error The error message. For an example, see Performing Validation.
ignorablewhitespace The content of the white space.
processinginstruction The entire content of the processing instruction, excluding the target. For example, if a processing instruction is:

<?xml-stylesheet type="text/css" href="mystyles.css"?>

then Value is:

type="text/css" href="mystyles.css"?

startprefixmapping The prefix, followed by a space, followed by the URI. For example, if a namespace declaration is as follows:

xmlns:s01="http://www.root.org"

then Value is:

s01 http://www.root.org

warning The warning message. For an example, see Performing Validation.
all other types (including element) null
seq

The sequence number of this node within the document. The first node is numbered 1. Note that an attribute has the same sequence number as the element to which it belongs.

Argument Lists for the Parse Methods

To specify a document source, you use the ParseFile(), ParseStream(), ParseString(), or ParseURL() method of your text reader. In any case, the source document must be a well-formed XML document; that is, it must obey the basic rules of XML syntax. For these methods, only the first two arguments are required. For reference, these methods have the following arguments, in order:

  1. Filename, Stream, String, or URL — Document source.

    Note that for ParseFile(), the Filename argument must contain only ASCII characters.

  2. TextReader — Text reader object, returned as an output parameter if the method returns $$$OK.

  3. Resolver — An entity resolver to use when parsing the source. See Performing Custom Entity Resolution in Customizing How the SAX Parser Is Used.

  4. Flags — A flag or combination of flags to control the validation and processing performed by the SAX parser. See Setting the Parser Flags in Customizing How the SAX Parser Is Used.

  5. Mask — A mask to specify which items are of interest in the XML source. See Specifying the Event Mask in Customizing How the SAX Parser Is Used.

    Tip:

    For the parsing methods of %XML.TextReaderOpens in a new tab, the default mask is $$$SAXCONTENTEVENTS. Note that this ignores comments. To parse all possible types of nodes, use $$$SAXALLEVENTS for this argument. Note that these macros are defined in the %occSAX.inc include file.

  6. SchemaSpec — A schema specification, against which to validate the document source. This argument is a string that contains a comma-separated list of namespace/URL pairs:

    "namespace URL,namespace URL"
    

    Here namespace is the XML namespace used for the schema and URL is a URL that gives the location of the schema document. There is a single space character between the namespace and URL values.

  7. KeepWhiteSpace — An option to keep white space or not.

  8. pHttpRequest — (For the ParseURL() method only) A request for the web server, as an instance of %Net.HttpRequestOpens in a new tab. By default, the system creates a new instance of %Net.HttpRequestOpens in a new tab and uses that, but you can instead make a request with a different instance of %Net.HttpRequestOpens in a new tab. This is useful in the case where you have a pre-existing %Net.HttpRequestOpens in a new tab with proxy and other properties already set. This option applies only to URLs of type http (not file or ftp, for example).

    For details on %Net.HttpRequestOpens in a new tab, see Using Internet Utilities. Or see the class documentation for %Net.HttpRequestOpens in a new tab.

Navigating the Document

To navigate through the document, you use the following methods of your text reader: Read(), ReadStartElement(), MoveToAttributeIndex(), MoveToAttributeName(), MoveToElement(), MoveToContent(), and Rewind().

Navigating to the Next Node

To move to the next node in a document, use the Read() method. The Read() method returns a true value until there are no more nodes to read (that is, until the end of the document is reached). The previous examples used this method in a loop like the following:

 While (textreader.Read()) {

...

 }

Navigating to the First Occurrence of a Specific Element

You can move to the first occurrence of a specific element within a document. To do so, use the ReadStartElement() method. This method returns true unless the element is not found. If the element is not found, the method reaches the end of the file.

The ReadStartElement() method takes two arguments: the name of the element and (optionally) the namespace URI. Note that the %XML.TextReaderOpens in a new tab class does not do any processing of namespace prefixes. Therefore the ReadStartElement() method regards the following two elements as having different names:

<Person>Smith,Ellen W. xmlns="http://www.person.org"</Person>

<s01:Person>Smith,Ellen W. xmlns:s01="http://www.person.org"</s01:Person>

Navigating to an Attribute

When you navigate to an element, if that element has attributes, you can navigate to them, in either of two ways:

  • Use the MoveToAttributeIndex() method to move to a specific attribute by index (ordinal position of the attribute within the element). This method takes one argument: the index number of the attribute. You can use the AttributeCount property to learn how many attributes a given element has; see Node Properties for a list of all properties.

  • Use the MoveToAttributeName() method to move to a specific attribute by name. This method takes two arguments: the name of the attribute and (optionally) the namespace URI. Note that the %XML.TextReaderOpens in a new tab class does not do any processing of namespace prefixes; if an attribute has a prefix, that prefix is considered part of the attribute name.

When you are finished with the attributes for the current element, you can move to the next element in the document by invoking one of the navigation methods such as Read(). Alternatively, you can invoke the MoveToElement() method to return to the element that contains the current attribute.

For example, the following code lists all the attributes for the current node by index number:

 If (textreader.NodeType = "element") {
     // list attributes for this node
     For a = 1:1:textreader.AttributeCount {
         Do textreader.MoveToAttributeIndex(a)
         Write textreader.LocalName," = ",textreader.Value,!
     }
 }

The following code finds the value of the color attribute for the current node:

 If (textreader.NodeType = "element") {
     // find color attribute for this node
     If (textreader.MoveToAttributeName("color")) {
         Write "color = ",textreader.Value,!
     }
 }

Navigating to the Next Node with Content

The MoveToContent() method helps you find content. Specifically:

  • If the node is of any type other than "chars", this method advances to the next node of type "chars".

  • If the node is of type "chars", this method does not advance in the file.

Rewinding

All the methods described here go forward in a document, except for the Rewind() method, which navigates to the start of the document and resets all properties.

Performing Validation

By default, the source document is validated against any DTD or schema document provided. If the document includes a DTD section, the document is validating against that DTD. To validate against a schema document instead, specify the schema within the argument list for ParseFile(), ParseStream(), ParseString(), or ParseURL(), as described in Argument Lists for the Parse Methods.

Most types of validation issues are nonfatal and cause either an error or a warning. Specifically, nodes of type "error" or "warning" are automatically added to the document tree, at the location where the error occurred. You can navigate to and inspect these nodes in the same way as any other type of node.

For example, consider the following XML document:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Root [
  <!ELEMENT Root (Person)>
  <!ELEMENT Person (#PCDATA)>
]>
<Root>
   <Person>Smith,Joe C.</Person>
</Root>

In this case, we do not expect any validation errors. Recall the example method WriteNodes() shown earlier in this topic. If we used that method to read this document, the output would be as follows:

Node 1 is a(n) element named: Root
    and has no value
Node 2 is a(n) ignorablewhitespace and has no name
    with value:
 
Node 3 is a(n) element named: Person
    and has no value
Node 4 is a(n) chars and has no name
    with value: Smith,Joe C.
Node 5 is a(n) endelement named: Person
    and has no value
Node 6 is a(n) ignorablewhitespace and has no name
    with value:
 
Node 7 is a(n) endelement named: Root
    and has no value

In contrast, suppose that the file looked like this instead:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Root [
  <!ELEMENT Root (Person)>
  <!ELEMENT Person (#PCDATA)>
]>
<Root>
   <Employee>Smith,Joe C.</Employee>
</Root>

In this case, we expect errors because the <Employee> element is not declared in the DTD section. Here, if we use the example method WriteNodes() to read this document, the output would be as follows:

Node 1 is a(n) element named: Root
    and has no value
Node 2 is a(n) ignorablewhitespace and has no name
    with value:
 
Node 3 is a(n) error and has no name
    with value: Unknown element 'Employee' 
while processing c:/TextReader/docwdtd2.txt at line 7 offset 14
Node 4 is a(n) element named: Employee
    and has no value
Node 5 is a(n) chars and has no name
    with value: Smith,Joe C.
Node 6 is a(n) endelement named: Employee
    and has no value
Node 7 is a(n) ignorablewhitespace and has no name
    with value:
 
Node 8 is a(n) error and has no name
    with value: Element 'Employee' is not valid for content model: '(Person)' 
while processing c:/TextReader/docwdtd2.txt at line 8 offset 8
Node 9 is a(n) endelement named: Root
    and has no value

Also see Setting the Parser Flags in Customizing How the SAX Parser Is Used.

Examples: Namespace Reporting

The following example method reads an arbitrary XML file and indicates the namespaces to which each element and attribute belongs:

ClassMethod ShowNamespacesInFile(filename As %String)
{
  Set status = ##class(%XML.TextReader).ParseFile(filename,.textreader)
  
  //check status
  If $$$ISERR(status) {do $System.Status.DisplayError(status) quit}
  
  //iterate through document, node by node
  While textreader.Read()
  {
    If (textreader.NodeType = "element")
    {
       Write !,"The element ",textreader.LocalName
       Write " is in the namespace ",textreader.NamespaceUri
       }
    If (textreader.NodeType = "attribute")
    {
       Write !,"The attribute ",textreader.LocalName
       Write " is in the namespace ",textreader.NamespaceUri
       }
     }
}

When used in the Terminal, this method produces output like the following:

 
The element Person is in the namespace www://www.person.com
The element Name is in the namespace www://www.person.com

The following variation accepts an XML-enabled object, writes it to a stream, and then uses that stream to generate the same type of report:

ClassMethod ShowNamespacesInObject(obj)
{
  set writer=##class(%XML.Writer).%New()

  set str=##class(%GlobalCharacterStream).%New()
  set status=writer.OutputToStream(str)
  if $$$ISERR(status) {do $System.Status.DisplayError(status) quit ""}

  //write to the stream
  set status=writer.RootObject(obj)
  if $$$ISERR(status) {do $System.Status.DisplayError(status) quit }

  Set status = ##class(%XML.TextReader).ParseStream(str,.textreader)
  
  //check status
  If $$$ISERR(status) {do $System.Status.DisplayError(status) quit}
  
  //iterate through document, node by node
  While textreader.Read()
  {
    If (textreader.NodeType = "element")
    {
       Write !,"The element ",textreader.LocalName
       Write " is in the namespace ",textreader.NamespaceUri
       }
    If (textreader.NodeType = "attribute")
    {
       Write !,"The attribute ",textreader.LocalName
       Write " is in the namespace ",textreader.NamespaceUri
       }
     }
  }
FeedbackOpens in a new tab