Working With Large Documents¶
The examples so far have all started by creating a data structure called a Document Object Model to represent the whole XML document. Using XPath expressions to navigate the DOM can be both powerful and convenient, but the cost in memory consumption can be quite high. For example, parsing a 50MB XML file into a DOM might need 500MB of memory.
If you routinely work with very large XML documents, you might find that
XML::LibXML
‘s DOM parser wants to consume more memory than your system has
installed. In such cases, you can instead use the ‘pull parser’ API which
is accessed via the XML::LibXML::Reader
interface.
The Reader Loop¶
To gain a better understanding of how the reader API is used, let’s start by seeing what happens when we parse this very simple XML document:
1 2 3 4 | <country code="IE">
<name>Ireland</name>
<population>4761657</population>
</country>
|
This script loads the reader API and parses the XML file:
#!/usr/bin/perl
use 5.010;
use strict;
use warnings;
use XML::LibXML::Reader;
my $filename = 'country.xml';
my $reader = XML::LibXML::Reader->new(location => $filename)
or die "cannot read file '$filename': $!\n";
while($reader->read) {
printf(
"Node type: %2u Depth: %2u Name: %s\n",
$reader->nodeType,
$reader->depth,
$reader->name
);
}
and produces the following output:
Node type: 1 Depth: 0 Name: country
Node type: 14 Depth: 1 Name: #text
Node type: 1 Depth: 1 Name: name
Node type: 3 Depth: 2 Name: #text
Node type: 15 Depth: 1 Name: name
Node type: 14 Depth: 1 Name: #text
Node type: 1 Depth: 1 Name: population
Node type: 3 Depth: 2 Name: #text
Node type: 15 Depth: 1 Name: population
Node type: 14 Depth: 1 Name: #text
Node type: 15 Depth: 0 Name: country
We can see from the output that the while
loop executes 11 times. As the
XML document is parsed, the $reader
object acts as a cursor advancing
through the document. Each time a ‘node’ has been parsed, the read
method returns to allow the state of the parse and the current node to be
interrogated.
To make sense of it we really need to turn those ‘Node Type’ numbers into
something a bit more readable. The XML::LibXML::Reader
module exports a
set of constants for this purpose. Here’s a modified version of the script:
#!/usr/bin/perl
use 5.010;
use strict;
use warnings;
use XML::LibXML::Reader;
my $filename = 'country.xml';
my $reader = XML::LibXML::Reader->new(location => $filename)
or die "cannot read file '$filename': $!\n";
my %type_name = (
&XML_READER_TYPE_ELEMENT => 'ELEMENT',
&XML_READER_TYPE_ATTRIBUTE => 'ATTRIBUTE',
&XML_READER_TYPE_TEXT => 'TEXT',
&XML_READER_TYPE_CDATA => 'CDATA',
&XML_READER_TYPE_ENTITY_REFERENCE => 'ENTITY_REFERENCE',
&XML_READER_TYPE_ENTITY => 'ENTITY',
&XML_READER_TYPE_PROCESSING_INSTRUCTION => 'PROCESSING_INSTRUCTION',
&XML_READER_TYPE_COMMENT => 'COMMENT',
&XML_READER_TYPE_DOCUMENT => 'DOCUMENT',
&XML_READER_TYPE_DOCUMENT_TYPE => 'DOCUMENT_TYPE',
&XML_READER_TYPE_DOCUMENT_FRAGMENT => 'DOCUMENT_FRAGMENT',
&XML_READER_TYPE_NOTATION => 'NOTATION',
&XML_READER_TYPE_WHITESPACE => 'WHITESPACE',
&XML_READER_TYPE_SIGNIFICANT_WHITESPACE => 'SIGNIFICANT_WHITESPACE',
&XML_READER_TYPE_END_ELEMENT => 'END_ELEMENT',
);
say " Step | Node Type | Depth | Name";
say "------+-------------------------+-------+-------";
my $step = 1;
while($reader->read) {
printf(
" %3u | %-22s | %4u | %s\n",
$step++,
$type_name{$reader->nodeType},
$reader->depth,
$reader->name
);
}
that produces the following tidier output:
Step | Node Type | Depth | Name
------+-------------------------+-------+-------
1 | ELEMENT | 0 | country
2 | SIGNIFICANT_WHITESPACE | 1 | #text
3 | ELEMENT | 1 | name
4 | TEXT | 2 | #text
5 | END_ELEMENT | 1 | name
6 | SIGNIFICANT_WHITESPACE | 1 | #text
7 | ELEMENT | 1 | population
8 | TEXT | 2 | #text
9 | END_ELEMENT | 1 | population
10 | SIGNIFICANT_WHITESPACE | 1 | #text
11 | END_ELEMENT | 0 | country
from the same XML :
1 2 3 4 | <country code="IE">
<name>Ireland</name>
<population>4761657</population>
</country>
|
Some things to note:
- At step 1, when the
read
method returns for the first time, the cursor has advanced to the closing ‘>’ of the<country>
start tag. We could retrieve an attribute value by calling$reader->getAttribute('code')
but we can’t examine child elements or text nodes because the parser has not seen them yet. - At step 2, the parser has processed a chunk of text and found that it
contains only whitespace (side note: all whitespace is considered to be
‘significant’ unless a DTD is loaded and defines which whitespace is
insignificant). Although we can get access to the text, the
$reader
object can no longer tell us that it is a child of a<country>
element - the parser has discarded that information already. - At step 3, the parser can tell us the current node is a
<name>
element, and thedepth
method can tell us that there is one ancestor element. However there is no way to determine the name of the parent element. - At step 4 a text node has been identified and we can call
$reader->value
to get the text string"Ireland"
, but the parser can no longer tell us the name of the element it belongs to. - At step 5 we have reached the end of the
<name>
element, but we no longer have access to the text it contained.
But now you surely get the idea - the XML::LibXML::Reader
API is able to
keep its memory requirements low by discarding data from one parse step before
proceeding to the next. The vastly lowered memory demands come at the cost of
significantly lowered convenience for the programmer. However, as we’ll see in
the next section, there is a middle ground that can provide the convenience of
the DOM API combined with the reduced memory usage of the Reader API.
Bring Back the DOM¶
Huge XML documents usually contain a long list of similar elements. For example Wikipedia make XML ‘dumps’ available for download.
At the time of writing, the enwiki-latest-abstract1.xml.gz
file was about
100MB in size - about 800MB uncompressed. However it contained information
summarising over half a million Wikipedia articles. So whilst the file is very
large, the <doc>
elements describing each article are, on average, less
than 1.5KB. The following extract is reformatted for clarity to illustrate
the file structure:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | <feed>
<doc>
<title>Wikipedia: Anarchism</title>
<url>https://en.wikipedia.org/wiki/Anarchism</url>
<abstract>Anarchism is a political philosophy that advocates
self-governed societies based on voluntary institutions.
These are often described as stateless societies …</abstract>
<links>
<sublink linktype="nav">
<anchor>History</anchor>
<link>https://en.wikipedia.org/wiki/Anarchism#History</link>
</sublink>
<sublink linktype="nav">
<anchor>Origins</anchor>
<link>https://en.wikipedia.org/wiki/Anarchism#Origins</link>
</sublink>
<!-- more sublink elements -->
</links>
</doc>
<doc>
<title>Wikipedia: Autism</title>
<url>https://en.wikipedia.org/wiki/Autism</url>
<abstract>…</abstract>
<links>
<!-- sublink elements -->
</links>
</doc>
<!-- (many) more doc elements -->
</feed>
|
To process this file, we can use the Reader API to locate each <doc>
element and then parse that element and all its children into a DOM fragment.
We can then use the familiar and convenient XPath tools and DOM methods to
process each fragment.
Another useful technique when working with large files is to leave the files in their compressed form and use a Perl IO layer to decompress them on the fly. You can achieve this using the PerlIO::gzip module from CPAN.
To illustrate these techniques, the following script uses the Reader API to
pick out each <doc>
element and slurp it into a DOM fragment. Then XPath
queries are used to examine the child nodes and determine if the <doc>
is
‘interesting’ - does it have a sub-heading that contains variant of the word
“controversy”? Uninteresting elements are skipped, interesting elements are
reported in summary form: article title, interesting subheading, URL.
#!/usr/bin/perl
use 5.010;
use strict;
use warnings;
use autodie;
use PerlIO::gzip;
use XML::LibXML::Reader;
binmode(STDOUT, ':utf8');
my $filename = 'enwiki-latest-abstract1-abridged.xml.gz';
open my $fh, '<:gzip', $filename;
my $reader = XML::LibXML::Reader->new(IO => $fh);
my $controversy_xpath = q{./links/sublink[contains(./anchor, 'Controvers')]};
while($reader->read) {
next unless $reader->nodeType == XML_READER_TYPE_ELEMENT;
next unless $reader->name eq 'doc';
my $doc = $reader->copyCurrentNode(1);
if(my($target) = $doc->findnodes($controversy_xpath)) {
say 'Title: ', $doc->findvalue('./title');
say ' ', $target->findvalue('./anchor');
say ' ', $target->findvalue('./link');
say '';
}
$reader->next;
}
In the script above, $doc
is a DOM fragment that can be queried and
manipulated using the DOM methods described in earlier chapters.
At the start of the while
loop, a couple of conditional next
statements
allow skipping quickly to the start of the next <doc>
element. Depending
on the document you’re dealing with, you might need to also use the depth
method to avoid deeply nested elements that also happened to be named “doc”.
The call to $reader->copyCurrentNode(1)
creates a DOM fragment from the
current element. The 1
passed as an argument is a boolean flag that causes
all child elements to be included.
In order to build the DOM fragment, the $reader
has to process all content
up to the matching XML_READER_TYPE_END_ELEMENT
node. You may be surprised
to learn that this does not advance the cursor. So the next call to
$reader->read
will advance to the first child node of the current
<doc>
. In our case, that would be a waste of time - there is no need to
use the Reader API to re-process the child nodes that we already processed with
the DOM API. Therefore after processing a <doc>
, we call $reader->next
to skip directly to the node following the matching </doc>
end tag. When
this script was used to process the full-sized file, adding this call to
next
reduced the run time by almost 50%.
When processing files with millions of elements, a small optimisation in the
main loop can make a noticeable difference to the run time. For example,
building the DOM fragment is a relatively expensive operation. The call to
$reader->copyCurrentNode(1)
is equivalent to:
my $xml = $reader->readOuterXml;
my $doc = XML::LibXML->load_xml(string => $xml);
As an optimisation, we can avoid the step of building the DOM fragment if a quick regex check of the source XML tells us that it doesn’t contain the word we’re going to look for with the XPath query. This rewritten main loop shave about 20% off the run time:
my $controversy_xpath = q{/doc/links/sublink[contains(./anchor, 'Controvers')]};
while($reader->read) {
next unless $reader->nodeType == XML_READER_TYPE_ELEMENT;
next unless $reader->name eq 'doc';
my $xml = $reader->readOuterXml;
if($xml =~ /Controvers/) {
my $doc = XML::LibXML->load_xml(string => $xml);
if(my($target) = $doc->findnodes($controversy_xpath)) {
say 'Title: ', $doc->findvalue('/doc/title');
say ' ', $target->findvalue('./anchor');
say ' ', $target->findvalue('./link');
say '';
}
}
$reader->next;
}
Error Handling¶
Error handling is a little different with the Reader API vs the DOM API. The DOM API will parse the whole document and throw an exception immediately if it encounters and error in the XML. So if there’s an error you won’t get a DOM.
The Reader API on the other hand will start returning nodes to your script via
$reader->read
as soon as the parsing starts [1]. If there is an error in your
document, you won’t know until your parser reaches the error - then you’ll get
the exception.
You need to bear this in mind when parsing with the Reader API. For example if you were reading elements to populate records in a database, you might want to wrap all the database INSERT statement in a transaction so that you can roll them all back if you encounter a parse error.
Another useful technique is to parse the document twice, once to check the XML
is well-formed and once to actually process it. The finish
method provides
a quick way to parse from the current position to the end of the document:
my $reader = XML::LibXML::Reader->new(IO => $fh);
$reader->finish;
You’ll then need to reopen the file and create a new Reader object for the second parse.
In some applications you might scan through the file looking for a specific
section. Once the target has been located and the required information
extracted, you might not need to look at any more elements. However as we’ve
seen, you should call finish
to ensure there are no errors in the rest of
the XML.
Working With Patterns¶
Our sample script is identifying elements at the top of the main loop by examining the node type and the node name:
while($reader->read) {
next unless $reader->nodeType == XML_READER_TYPE_ELEMENT;
next unless $reader->name eq 'doc';
Although these are simple checks, they do still involve two method calls and
passing scalar values across the XS boundary between libxml
and the Perl
runtime. An alternative approach is to compile a ‘pattern’ (essentially a
simplified subset of XPath) using XML::LibXML::Pattern and run a complex set of
checks with a single method call:
my $doc_pattern = XML::LibXML::Pattern->new('/feed/doc');
while($reader->read) {
next unless $reader->matchesPattern($doc_pattern);
In our example, the <doc>
elements that we’re interested in are all
adjacent, so when we finish processing one, the very next element is another
<doc>
. If your document is not structured this way, you might find it
useful to skip over large sections of document to find the next element that
matches a pattern, like this:
$reader->nextPatternMatch($pattern);
You can also use patterns with the preservePattern
method to create a DOM
subset of a larger document. For example:
my $filename = 'enwiki-latest-abstract1-structure.xml';
my $reader = XML::LibXML::Reader->new(location => $filename);
$reader->preservePattern('/feed/doc/title');
$reader->finish;
say $reader->document->toString(1);
Which will produce this output:
<?xml version="1.0"?>
<feed>
<doc>
<title>Wikipedia: Anarchism</title>
</doc>
<doc>
<title>Wikipedia: Autism</title>
</doc>
</feed>
Note, this technique does construct the DOM in memory and then serialise it at the end, so if you have a huge document and many nodes match the pattern then you will consume a large amount of memory.
Footnotes
[1] | In practice, the Reader API will read the XML in chunks and check each chunk is well-formed before it starts delivering node events. This means that a short document with an error may trigger an exception before any nodes have been delivered. |