A Basic Example¶
The first thing you’ll need is an XML document. The example programs in this
section will use the playlist.xml
file shown below. This file contains details of five different movies:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 | <playlist>
<movie id="tt0112384">
<title>Apollo 13</title>
<director>Ron Howard</director>
<release-date>1995-06-30</release-date>
<mpaa-rating>PG</mpaa-rating>
<running-time>140</running-time>
<genre>adventure</genre>
<genre>drama</genre>
<cast>
<person name="Tom Hanks" role="Jim Lovell" />
<person name="Bill Paxton" role="Fred Haise" />
<person name="Kevin Bacon" role="Jack Swigert" />
<person name="Gary Sinise" role="Ken Mattingly" />
<person name="Ed Harris" role="Gene Kranz" />
</cast>
<imdb-info url="http://www.imdb.com/title/tt0112384/">
<synopsis>
NASA must devise a strategy to return Apollo 13 to Earth safely
after the spacecraft undergoes massive internal damage putting
the lives of the three astronauts on board in jeopardy.
</synopsis>
<score>7.6</score>
</imdb-info>
</movie>
<movie id="tt0307479">
<title>Solaris</title>
<director>Steven Soderbergh</director>
<release-date>2002-11-27</release-date>
<mpaa-rating>PG-13</mpaa-rating>
<running-time>99</running-time>
<genre>drama</genre>
<genre>mystery</genre>
<genre>romance</genre>
<cast>
<person name="George Clooney" role="Chris Kelvin" />
<person name="Natascha McElhone" role="Rheya" />
<person name="Ulrich Tukur" role="Gibarian" />
</cast>
<imdb-info url="http://www.imdb.com/title/tt0307479/">
<synopsis>
A troubled psychologist is sent to investigate the crew of an
isolated research station orbiting a bizarre planet.
</synopsis>
<score>6.2</score>
</imdb-info>
</movie>
<movie id="tt1731141">
<title>Ender's Game</title>
<director>Gavin Hood</director>
<release-date>2013-11-01</release-date>
<mpaa-rating>PG-13</mpaa-rating>
<running-time>114</running-time>
<genre>action</genre>
<genre>scifi</genre>
<cast>
<person name="Asa Butterfield" role="Ender Wiggin" />
<person name="Harrison Ford" role="Colonel Graff" />
<person name="Hailee Steinfeld" role="Petra Arkanian" />
</cast>
<imdb-info url="http://www.imdb.com/title/tt1731141/">
<synopsis>
Young Ender Wiggin is recruited by the International Military
to lead the fight against the Formics, a genocidal alien race
which nearly annihilated the human race in a previous invasion.
</synopsis>
<score>6.7</score>
</imdb-info>
</movie>
<movie id="tt0816692">
<title>Interstellar</title>
<director>Christopher Nolan</director>
<release-date>2014-11-07</release-date>
<mpaa-rating>PG-13</mpaa-rating>
<running-time>169</running-time>
<genre>adventure</genre>
<genre>drama</genre>
<genre>scifi</genre>
<cast>
<person name="Matthew McConaughey" role="Cooper" />
<person name="Anne Hathaway" role="Brand" />
<person name="Jessica Chastain" role="Murph" />
<person name="Michael Caine" role="Professor Brand" />
</cast>
<imdb-info url="http://www.imdb.com/title/tt0816692/">
<synopsis>
A team of explorers travel through a wormhole in space in an
attempt to ensure humanity's survival.
</synopsis>
<score>8.6</score>
</imdb-info>
</movie>
<movie id="tt3659388">
<title>The Martian</title>
<director>Ridley Scott</director>
<release-date>2015-10-02</release-date>
<mpaa-rating>PG-13</mpaa-rating>
<running-time>144</running-time>
<genre>adventure</genre>
<genre>drama</genre>
<genre>scifi</genre>
<cast>
<person name="Matt Damon" role="Mark Watney" />
<person name="Jessica Chastain" role="Melissa Lewis" />
<person name="Kristen Wiig" role="Annie Montrose" />
</cast>
<imdb-info url="http://www.imdb.com/title/tt3659388/">
<synopsis>
During a manned mission to Mars, Astronaut Mark Watney is
presumed dead after a fierce storm and left behind by his crew.
But Watney has survived and finds himself stranded and alone on
the hostile planet. With only meager supplies, he must draw upon
his ingenuity, wit and spirit to subsist and find a way to
signal to Earth that he is alive.
</synopsis>
<score>8.1</score>
</imdb-info>
</movie>
</playlist>
|
Note
Although this XML document contains details which came from the fabulous IMDb.com web site, the file structure was created specifically for this example and does not represent an actual API for querying movie details.
Once you have the sample XML document, you can use this script
to extract and print the title of each movie,
in the order they appear in the XML:
#!/usr/bin/perl
use 5.010;
use strict;
use warnings;
use XML::LibXML;
my $filename = 'playlist.xml';
my $dom = XML::LibXML->load_xml(location => $filename);
foreach my $title ($dom->findnodes('/playlist/movie/title')) {
say $title->to_literal();
}
and will produce the following output:
Apollo 13
Solaris
Ender's Game
Interstellar
The Martian
If we break the example down line-by-line we see that after a standard
boilerplate section, the script loads the XML::LibXML
module:
use XML::LibXML;
Next, the load_xml()
class method is called to parse the XML file and
return a document object:
my $dom = XML::LibXML->load_xml(location => $filename);
The $dom
variable now contains an object representing all the elements of
the XML document arranged in a tree structure known as a
Document Object Model or ‘DOM’.
Finally we get to the guts of the script where the findnodes()
method is
called to search the DOM for the elements we’re interested in and a foreach
loop is used to iterate through the matching elements:
foreach my $title ($dom->findnodes('/playlist/movie/title')) {
say $title->to_literal();
}
The findnodes()
method takes one argument - an XPath expression. This
is a string describing the location and characteristics of the elements we want
to find. XPath is a query language and the way we use it to select elements
from the DOM is similar to the way we use SQL to select records from a
relational database. The next section (XPath Expressions) will include examples of
more complex queries.
The findnodes()
method returns a list of objects from the DOM that match
the XPath expression. Each time through the loop, $title
will contain an
object representing the next matching element. This object provides a number
of properties and methods that you can use to access the element and its
attributes, as well as any text content and ‘child’ elements.
Inside the loop, this example simply calls the to_literal()
method to get
the text content of the element. The string returned by to_literal()
will
not include any of the attributes but will include the text content of any
child elements.
Other XML sources¶
The first example script called XML::LibXML->load_xml()
with the
location
argument set to the name of a file. The location
argument
also accepts a URL:
$dom = XML::LibXML->load_xml(location => 'http://techcrunch.com/feed/');
Note
Not all versions of libxml2
can retrieve documents over SSL/TLS. So if
the URL is an ‘https’ URL (or if it redirects to one), you may need to use
a module like LWP to retrieve
the document and pass the response body to the XML parser as a string as
shown below.
If you have the XML in a string, instead of location
, use string
:
$dom = XML::LibXML->load_xml(string => $xml_string);
Or, you can provide a Perl file handle to parse from an open file or socket,
using IO
:
$dom = XML::LibXML->load_xml(IO => $fh);
When providing a string or a file handle, it’s crucial that you do not
decode the bytes of the source data (for example by using ':utf8'
when
opening a file). The underlying libxml2
library is written in C to decode
bytes and does not understand Perl’s character strings. If you have assembled
your XML document by concatenating Perl character strings, you will need to
encode it to a byte string (for example using Encode::encode_utf8()
) and
then pass the byte string to the parser.
If you have enabled UTF-8 globally with something like this in your script:
use open ':encoding(utf8)';
Then you’ll need to turn off the encoding IO layers for any file handle that you pass to XML::LibXML:
open my $fh, '<', $filename;
binmode $fh, ':raw';
$dom = XML::LibXML->load_xml(IO => $fh);
A more complex example¶
Now let’s look at a slightly more complex example. This script
takes the same XML input and extracts more
details from each <movie>
element:
#!/usr/bin/perl
use 5.010;
use strict;
use warnings;
use XML::LibXML;
my $filename = 'playlist.xml';
my $dom = XML::LibXML->load_xml(location => $filename);
foreach my $movie ($dom->findnodes('//movie')) {
say 'Title: ', $movie->findvalue('./title');
say 'Director: ', $movie->findvalue('./director');
say 'Rating: ', $movie->findvalue('./mpaa-rating');
say 'Duration: ', $movie->findvalue('./running-time'), " minutes";
my $cast = join ', ', map {
$_->to_literal();
} $movie->findnodes('./cast/person/@name');
say 'Starring: ', $cast;
say "";
}
and will produce the following output:
Title: Apollo 13
Director: Ron Howard
Rating: PG
Duration: 140 minutes
Starring: Tom Hanks, Bill Paxton, Kevin Bacon, Gary Sinise, Ed Harris
Title: Solaris
Director: Steven Soderbergh
Rating: PG-13
Duration: 99 minutes
Starring: George Clooney, Natascha McElhone, Ulrich Tukur
Title: Ender's Game
Director: Gavin Hood
Rating: PG-13
Duration: 114 minutes
Starring: Asa Butterfield, Harrison Ford, Hailee Steinfeld
Title: Interstellar
Director: Christopher Nolan
Rating: PG-13
Duration: 169 minutes
Starring: Matthew McConaughey, Anne Hathaway, Jessica Chastain, Michael Caine
Title: The Martian
Director: Ridley Scott
Rating: PG-13
Duration: 144 minutes
Starring: Matt Damon, Jessica Chastain, Kristen Wiig
Let’s compare the main loop of the first script:
foreach my $title ($dom->findnodes('/playlist/movie/title')) {
say $title->to_literal();
}
with the main loop of the second script:
foreach my $movie ($dom->findnodes('//movie')) {
say 'Title: ', $movie->findvalue('./title');
say 'Director: ', $movie->findvalue('./director');
say 'Rating: ', $movie->findvalue('./mpaa-rating');
say 'Duration: ', $movie->findvalue('./running-time'), " minutes";
my $cast = join ', ', map {
$_->to_literal();
} $movie->findnodes('./cast/person/@name');
say 'Starring: ', $cast;
say "";
}
The structure of the main loop is very similar but the XPath expression
passed to findnodes()
is different in each case:
'/playlist/movie/title'
- Will match every
<title>
element which is the child of ...a<movie>
element which is the child of ...a<playlist>
element which is ...the top-level element in the document.Or, to phrase it a different way, the search will start at the top of the document and look for a
<playlist>
element; if one is found, the search will continue for child<movie>
elements; and for each one that is found the search will continue for child<title>
elements. '//movie'
- Will match every
<movie>
element at any level of nesting.
In both cases, the XPath expression starts with a ‘/’ which means the search will start at the the top of the document.
Inside the second script’s loop are a number of calls to findvalue()
. This
is a handy shortcut method that is typically used when you expect the XPath
expression to match exactly one node. It combines the functionality of
findnodes()
and to_literal()
into a single method. So this code:
$movie->findvalue('./title');
is equivalent to:
$movie->findnodes('./title')->to_literal();
There are a couple of other interesting differences with the XPath searches in
the loop compared to previous examples. Firstly, the findvalue()
method is
being called on $movie
(which represents one <movie>
element) rather
than on $dom
(which represents the whole document). This means that the
$movie
element is the context element. Secondly, the XPath expression
starts with a ‘.’ which means: start the search at the context element rather
than at the top of the document.
This second script illustrates a common pattern when working with XML::LibXML
:
- find ‘interesting’ elements using an XPath query starting with ‘/’ or ‘//’
- iterate through those elements in a
foreach
loop - get additional data from child elements using XPath queries starting with ‘.’
Accessing attributes¶
When listing cast members in the main loop of the script above, this code ...
my $cast = join ', ', map {
$_->to_literal();
} $movie->findnodes('./cast/person/@name');
say 'Starring: ', $cast;
is used to transform this XML ...
1 2 3 4 5 | <cast>
<person name="Matt Damon" role="Mark Watney" />
<person name="Jessica Chastain" role="Melissa Lewis" />
<person name="Kristen Wiig" role="Annie Montrose" />
</cast>
|
into this output:
Starring: Matt Damon, Jessica Chastain, Kristen Wiig
In an XPath expression, a name that starts with @
will match an attribute
rather than an element, so 'person/@name'
refers to an attribute called
name
on a <person>
element. In this case, the call to
findnodes('./cast/person/@name')
will return three DOM nodes representing
attribute values which are then transformed into plain strings using
to_literal()
, as we’ve seen for element nodes, inside a map block.
Another approach is to select the element with XPath and then call a DOM method on the element node to get the attribute value:
my $cast = join ', ', map {
$_->getAttribute('name');
} $movie->findnodes('./cast/person');
say 'Starring: ', $cast;
Attributes via tied hash¶
There’s a shortcut syntax you can use to make this even easier, simply treat the element node as a hashref:
my $cast = join ', ', map {
$_->{name};
} $movie->findnodes('./cast/person');
say 'Starring: ', $cast;
You might be a bit wary of poking around directly inside the element object,
rather than using accessor methods. But don’t worry, that’s not what this
shortcut syntax is doing. Instead, every XML::LibXML::Element object returned from the
XPath query has been ‘tied’ using
XML::LibXML::AttributeHash so that hash lookups
‘inside’ the object actually get proxied to getAttribute()
method calls.
This technique is less efficient than calling getAttribute()
directly but
it is very convenient when you want to access more than one attribute of an
element or when you want to interpolate an attribute value into a string:
my $cast = join "\n", map {
" * $_->{name} (as $_->{role})";
} $movie->findnodes('./cast/person');
say "\nStarring:\n", $cast;
Which will produce this output:
Starring:
* Matt Damon (as Mark Watney)
* Jessica Chastain (as Melissa Lewis)
* Kristen Wiig (as Annie Montrose)
Note
Overloading ‘Element’ nodes to support tied hash access to attribute values was added in version 1.91 of XML::LibXML. If the examples above don’t work for you then it may be because you have a very old version installed.
Parsing Errors¶
One of the advantages of XML is that it has a few strict rules that every document must comply with to be considered “well-formed”. If a document is not well-formed, it should be rejected in its entirety and no part of the XML document content should be used. Examples of things that would cause a document to be not well-formed include:
- missing or mismatched closing tag
- missing or mismatched quotes around attribute values
- whitespace before the initial XML declaration section
- byte sequences that do not match the document’s declared character encoding
- any non-whitespace characters after the closing tag for the first top-level element
Like pretty much all XML parser modules, libxml
will throw an exception
if it encounters any violations of these rules. Since the whole of the XML
document is processed when load_xml
is called, an error at any point in
the document will cause an exception to be raised.
If you wish to handle exceptions gracefully use must use an eval
block or
one of the “try/catch” syntax extension modules to catch the error. For
example, this document contains an error:
1 2 3 4 5 6 7 8 9 10 | <?xml version='1.0' encoding='UTF-8' standalone="yes" ?>
<book edition="2">
<title>Training Your Pet Ferret</title>
<authors>
<author>Gerry Bucsis</author>
<author>Barbara Somerville</author>
</authors>
<isbn>9780764142239</isnb>
<dimensions width="162.56mm" height="195.58mm" depth="10.16mm" pages="96" />
</book>
|
This script will attempt to parse the bad input:
#my $filename = 'book.xml';
my $filename = 'book-borkened.xml';
my $dom = eval {
XML::LibXML->load_xml(location => $filename);
};
if($@) {
# Log failure and exit
print "Error parsing '$filename':\n$@";
exit 0;
}
foreach my $author ($dom->findnodes('//author')) {
say $author->to_literal();
and will instead produce this output:
Error parsing 'book-borkened.xml':
book-borkened.xml:8: parser error : Opening and ending tag mismatch: isbn line 8 and isnb
<isbn>9780764142239</isnb>
^
Note that although the script is only looking for <author>
elements and the
error in the <isbn>
element comes after all the <author>
elements, an
exception is still raised by the load_xml
call inside the eval block,
before the DOM has been fully constructed.
That’s it for the basic examples. The next topic will look more closely at XPath expressions.