Working With Large Documents

The examples so far have all started by creating a data structure called a Document Object Model to represent the whole XML document. Using XPath expressions to navigate the DOM can be both powerful and convenient, but the cost in memory consumption can be quite high. For example, parsing a 50MB XML file into a DOM might need 500MB of memory.

If you routinely work with very large XML documents, you might find that XML::LibXML‘s DOM parser wants to consume more memory than your system has installed. In such cases, you can instead use the ‘pull parser’ API which is accessed via the XML::LibXML::Reader interface.

The Reader Loop

To gain a better understanding of how the reader API is used, let’s start by seeing what happens when we parse this very simple XML document:

1
2
3
4
<country code="IE">
  <name>Ireland</name>
  <population>4761657</population>
</country>

This script loads the reader API and parses the XML file:

#!/usr/bin/perl

use 5.010;
use strict;
use warnings;

use XML::LibXML::Reader;

my $filename = 'country.xml';

my $reader = XML::LibXML::Reader->new(location => $filename)
    or die "cannot read file '$filename': $!\n";

while($reader->read) {
    printf(
        "Node type: %2u  Depth: %2u  Name: %s\n",
        $reader->nodeType,
        $reader->depth,
        $reader->name
    );
}

and produces the following output:

Node type:  1  Depth:  0  Name: country
Node type: 14  Depth:  1  Name: #text
Node type:  1  Depth:  1  Name: name
Node type:  3  Depth:  2  Name: #text
Node type: 15  Depth:  1  Name: name
Node type: 14  Depth:  1  Name: #text
Node type:  1  Depth:  1  Name: population
Node type:  3  Depth:  2  Name: #text
Node type: 15  Depth:  1  Name: population
Node type: 14  Depth:  1  Name: #text
Node type: 15  Depth:  0  Name: country

We can see from the output that the while loop executes 11 times. As the XML document is parsed, the $reader object acts as a cursor advancing through the document. Each time a ‘node’ has been parsed, the read method returns to allow the state of the parse and the current node to be interrogated.

To make sense of it we really need to turn those ‘Node Type’ numbers into something a bit more readable. The XML::LibXML::Reader module exports a set of constants for this purpose. Here’s a modified version of the script:

#!/usr/bin/perl

use 5.010;
use strict;
use warnings;

use XML::LibXML::Reader;

my $filename = 'country.xml';

my $reader = XML::LibXML::Reader->new(location => $filename)
    or die "cannot read file '$filename': $!\n";

my %type_name = (
    &XML_READER_TYPE_ELEMENT                 => 'ELEMENT',
    &XML_READER_TYPE_ATTRIBUTE               => 'ATTRIBUTE',
    &XML_READER_TYPE_TEXT                    => 'TEXT',
    &XML_READER_TYPE_CDATA                   => 'CDATA',
    &XML_READER_TYPE_ENTITY_REFERENCE        => 'ENTITY_REFERENCE',
    &XML_READER_TYPE_ENTITY                  => 'ENTITY',
    &XML_READER_TYPE_PROCESSING_INSTRUCTION  => 'PROCESSING_INSTRUCTION',
    &XML_READER_TYPE_COMMENT                 => 'COMMENT',
    &XML_READER_TYPE_DOCUMENT                => 'DOCUMENT',
    &XML_READER_TYPE_DOCUMENT_TYPE           => 'DOCUMENT_TYPE',
    &XML_READER_TYPE_DOCUMENT_FRAGMENT       => 'DOCUMENT_FRAGMENT',
    &XML_READER_TYPE_NOTATION                => 'NOTATION',
    &XML_READER_TYPE_WHITESPACE              => 'WHITESPACE',
    &XML_READER_TYPE_SIGNIFICANT_WHITESPACE  => 'SIGNIFICANT_WHITESPACE',
    &XML_READER_TYPE_END_ELEMENT             => 'END_ELEMENT',
);

say " Step | Node Type               | Depth | Name";
say "------+-------------------------+-------+-------";

my $step = 1;
while($reader->read) {
    printf(
        " %3u  | %-22s  | %4u  | %s\n",
        $step++,
        $type_name{$reader->nodeType},
        $reader->depth,
        $reader->name
    );
}

that produces the following tidier output:

 Step | Node Type               | Depth | Name
------+-------------------------+-------+-------
   1  | ELEMENT                 |    0  | country
   2  | SIGNIFICANT_WHITESPACE  |    1  | #text
   3  | ELEMENT                 |    1  | name
   4  | TEXT                    |    2  | #text
   5  | END_ELEMENT             |    1  | name
   6  | SIGNIFICANT_WHITESPACE  |    1  | #text
   7  | ELEMENT                 |    1  | population
   8  | TEXT                    |    2  | #text
   9  | END_ELEMENT             |    1  | population
  10  | SIGNIFICANT_WHITESPACE  |    1  | #text
  11  | END_ELEMENT             |    0  | country

from the same XML :

1
2
3
4
<country code="IE">
  <name>Ireland</name>
  <population>4761657</population>
</country>

Some things to note:

  • At step 1, when the read method returns for the first time, the cursor has advanced to the closing ‘>’ of the <country> start tag. We could retrieve an attribute value by calling $reader->getAttribute('code') but we can’t examine child elements or text nodes because the parser has not seen them yet.
  • At step 2, the parser has processed a chunk of text and found that it contains only whitespace (side note: all whitespace is considered to be ‘significant’ unless a DTD is loaded and defines which whitespace is insignificant). Although we can get access to the text, the $reader object can no longer tell us that it is a child of a <country> element - the parser has discarded that information already.
  • At step 3, the parser can tell us the current node is a <name> element, and the depth method can tell us that there is one ancestor element. However there is no way to determine the name of the parent element.
  • At step 4 a text node has been identified and we can call $reader->value to get the text string "Ireland", but the parser can no longer tell us the name of the element it belongs to.
  • At step 5 we have reached the end of the <name> element, but we no longer have access to the text it contained.

But now you surely get the idea - the XML::LibXML::Reader API is able to keep its memory requirements low by discarding data from one parse step before proceeding to the next. The vastly lowered memory demands come at the cost of significantly lowered convenience for the programmer. However, as we’ll see in the next section, there is a middle ground that can provide the convenience of the DOM API combined with the reduced memory usage of the Reader API.

Bring Back the DOM

Huge XML documents usually contain a long list of similar elements. For example Wikipedia make XML ‘dumps’ available for download.

At the time of writing, the enwiki-latest-abstract1.xml.gz file was about 100MB in size - about 800MB uncompressed. However it contained information summarising over half a million Wikipedia articles. So whilst the file is very large, the <doc> elements describing each article are, on average, less than 1.5KB. The following extract is reformatted for clarity to illustrate the file structure:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
<feed>
  <doc>
    <title>Wikipedia: Anarchism</title>
    <url>https://en.wikipedia.org/wiki/Anarchism</url>
    <abstract>Anarchism is a political philosophy that advocates
    self-governed societies based on voluntary institutions.
    These are often described as stateless societies …</abstract>
    <links>
      <sublink linktype="nav">
        <anchor>History</anchor>
        <link>https://en.wikipedia.org/wiki/Anarchism#History</link>
      </sublink>
      <sublink linktype="nav">
        <anchor>Origins</anchor>
        <link>https://en.wikipedia.org/wiki/Anarchism#Origins</link>
      </sublink>
      <!-- more sublink elements -->
    </links>
  </doc>
  <doc>
    <title>Wikipedia: Autism</title>
    <url>https://en.wikipedia.org/wiki/Autism</url>
    <abstract></abstract>
    <links>
      <!-- sublink elements -->
    </links>
  </doc>
  <!-- (many) more doc elements -->
</feed>

To process this file, we can use the Reader API to locate each <doc> element and then parse that element and all its children into a DOM fragment. We can then use the familiar and convenient XPath tools and DOM methods to process each fragment.

Another useful technique when working with large files is to leave the files in their compressed form and use a Perl IO layer to decompress them on the fly. You can achieve this using the PerlIO::gzip module from CPAN.

To illustrate these techniques, the following script uses the Reader API to pick out each <doc> element and slurp it into a DOM fragment. Then XPath queries are used to examine the child nodes and determine if the <doc> is ‘interesting’ - does it have a sub-heading that contains variant of the word “controversy”? Uninteresting elements are skipped, interesting elements are reported in summary form: article title, interesting subheading, URL.

#!/usr/bin/perl

use 5.010;
use strict;
use warnings;
use autodie;

use PerlIO::gzip;
use XML::LibXML::Reader;

binmode(STDOUT, ':utf8');

my $filename = 'enwiki-latest-abstract1-abridged.xml.gz';
open my $fh, '<:gzip', $filename;

my $reader = XML::LibXML::Reader->new(IO => $fh);

my $controversy_xpath = q{./links/sublink[contains(./anchor, 'Controvers')]};

while($reader->read) {
    next unless $reader->nodeType == XML_READER_TYPE_ELEMENT;
    next unless $reader->name eq 'doc';
    my $doc = $reader->copyCurrentNode(1);
    if(my($target) = $doc->findnodes($controversy_xpath)) {
        say 'Title: ', $doc->findvalue('./title');
        say '  ', $target->findvalue('./anchor');
        say '  ', $target->findvalue('./link');
        say '';
    }
    $reader->next;
}

In the script above, $doc is a DOM fragment that can be queried and manipulated using the DOM methods described in earlier chapters.

At the start of the while loop, a couple of conditional next statements allow skipping quickly to the start of the next <doc> element. Depending on the document you’re dealing with, you might need to also use the depth method to avoid deeply nested elements that also happened to be named “doc”.

The call to $reader->copyCurrentNode(1) creates a DOM fragment from the current element. The 1 passed as an argument is a boolean flag that causes all child elements to be included.

In order to build the DOM fragment, the $reader has to process all content up to the matching XML_READER_TYPE_END_ELEMENT node. You may be surprised to learn that this does not advance the cursor. So the next call to $reader->read will advance to the first child node of the current <doc>. In our case, that would be a waste of time - there is no need to use the Reader API to re-process the child nodes that we already processed with the DOM API. Therefore after processing a <doc>, we call $reader->next to skip directly to the node following the matching </doc> end tag. When this script was used to process the full-sized file, adding this call to next reduced the run time by almost 50%.

When processing files with millions of elements, a small optimisation in the main loop can make a noticeable difference to the run time. For example, building the DOM fragment is a relatively expensive operation. The call to $reader->copyCurrentNode(1) is equivalent to:

my $xml = $reader->readOuterXml;
my $doc = XML::LibXML->load_xml(string => $xml);

As an optimisation, we can avoid the step of building the DOM fragment if a quick regex check of the source XML tells us that it doesn’t contain the word we’re going to look for with the XPath query. This rewritten main loop shave about 20% off the run time:

my $controversy_xpath = q{/doc/links/sublink[contains(./anchor, 'Controvers')]};

while($reader->read) {
    next unless $reader->nodeType == XML_READER_TYPE_ELEMENT;
    next unless $reader->name eq 'doc';
    my $xml = $reader->readOuterXml;
    if($xml =~ /Controvers/) {
        my $doc = XML::LibXML->load_xml(string => $xml);
        if(my($target) = $doc->findnodes($controversy_xpath)) {
            say 'Title: ', $doc->findvalue('/doc/title');
            say '  ', $target->findvalue('./anchor');
            say '  ', $target->findvalue('./link');
            say '';
        }
    }
    $reader->next;
}

Error Handling

Error handling is a little different with the Reader API vs the DOM API. The DOM API will parse the whole document and throw an exception immediately if it encounters and error in the XML. So if there’s an error you won’t get a DOM.

The Reader API on the other hand will start returning nodes to your script via $reader->read as soon as the parsing starts [1]. If there is an error in your document, you won’t know until your parser reaches the error - then you’ll get the exception.

You need to bear this in mind when parsing with the Reader API. For example if you were reading elements to populate records in a database, you might want to wrap all the database INSERT statement in a transaction so that you can roll them all back if you encounter a parse error.

Another useful technique is to parse the document twice, once to check the XML is well-formed and once to actually process it. The finish method provides a quick way to parse from the current position to the end of the document:

    my $reader = XML::LibXML::Reader->new(IO => $fh);
    $reader->finish;

You’ll then need to reopen the file and create a new Reader object for the second parse.

In some applications you might scan through the file looking for a specific section. Once the target has been located and the required information extracted, you might not need to look at any more elements. However as we’ve seen, you should call finish to ensure there are no errors in the rest of the XML.

Working With Patterns

Our sample script is identifying elements at the top of the main loop by examining the node type and the node name:

while($reader->read) {
    next unless $reader->nodeType == XML_READER_TYPE_ELEMENT;
    next unless $reader->name eq 'doc';

Although these are simple checks, they do still involve two method calls and passing scalar values across the XS boundary between libxml and the Perl runtime. An alternative approach is to compile a ‘pattern’ (essentially a simplified subset of XPath) using XML::LibXML::Pattern and run a complex set of checks with a single method call:

my $doc_pattern = XML::LibXML::Pattern->new('/feed/doc');
while($reader->read) {
    next unless $reader->matchesPattern($doc_pattern);

In our example, the <doc> elements that we’re interested in are all adjacent, so when we finish processing one, the very next element is another <doc>. If your document is not structured this way, you might find it useful to skip over large sections of document to find the next element that matches a pattern, like this:

$reader->nextPatternMatch($pattern);

You can also use patterns with the preservePattern method to create a DOM subset of a larger document. For example:

my $filename = 'enwiki-latest-abstract1-structure.xml';

my $reader = XML::LibXML::Reader->new(location => $filename);
$reader->preservePattern('/feed/doc/title');
$reader->finish;

say $reader->document->toString(1);

Which will produce this output:

<?xml version="1.0"?>
<feed>
  <doc>
    <title>Wikipedia: Anarchism</title>
  </doc>
  <doc>
    <title>Wikipedia: Autism</title>
  </doc>
</feed>

Note, this technique does construct the DOM in memory and then serialise it at the end, so if you have a huge document and many nodes match the pattern then you will consume a large amount of memory.

Footnotes

[1]In practice, the Reader API will read the XML in chunks and check each chunk is well-formed before it starts delivering node events. This means that a short document with an error may trigger an exception before any nodes have been delivered.