A Basic Example

The first thing you’ll need is an XML document. The example programs in this section will use the playlist.xml file shown below. This file contains details of five different movies:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
<playlist>
  <movie id="tt0112384">
    <title>Apollo 13</title>
    <director>Ron Howard</director>
    <release-date>1995-06-30</release-date>
    <mpaa-rating>PG</mpaa-rating>
    <running-time>140</running-time>
    <genre>adventure</genre>
    <genre>drama</genre>
    <cast>
      <person name="Tom Hanks" role="Jim Lovell" />
      <person name="Bill Paxton" role="Fred Haise" />
      <person name="Kevin Bacon" role="Jack Swigert" />
      <person name="Gary Sinise" role="Ken Mattingly" />
      <person name="Ed Harris" role="Gene Kranz" />
    </cast>
    <imdb-info url="http://www.imdb.com/title/tt0112384/">
      <synopsis>
        NASA must devise a strategy to return Apollo 13 to Earth safely
        after the spacecraft undergoes massive internal damage putting
        the lives of the three astronauts on board in jeopardy.
      </synopsis>
      <score>7.6</score>
    </imdb-info>
  </movie>
  <movie id="tt0307479">
    <title>Solaris</title>
    <director>Steven Soderbergh</director>
    <release-date>2002-11-27</release-date>
    <mpaa-rating>PG-13</mpaa-rating>
    <running-time>99</running-time>
    <genre>drama</genre>
    <genre>mystery</genre>
    <genre>romance</genre>
    <cast>
      <person name="George Clooney" role="Chris Kelvin" />
      <person name="Natascha McElhone" role="Rheya" />
      <person name="Ulrich Tukur" role="Gibarian" />
    </cast>
    <imdb-info url="http://www.imdb.com/title/tt0307479/">
      <synopsis>
        A troubled psychologist is sent to investigate the crew of an
        isolated research station orbiting a bizarre planet.
      </synopsis>
      <score>6.2</score>
    </imdb-info>
  </movie>
  <movie id="tt1731141">
    <title>Ender's Game</title>
    <director>Gavin Hood</director>
    <release-date>2013-11-01</release-date>
    <mpaa-rating>PG-13</mpaa-rating>
    <running-time>114</running-time>
    <genre>action</genre>
    <genre>scifi</genre>
    <cast>
      <person name="Asa Butterfield" role="Ender Wiggin" />
      <person name="Harrison Ford" role="Colonel Graff" />
      <person name="Hailee Steinfeld" role="Petra Arkanian" />
    </cast>
    <imdb-info url="http://www.imdb.com/title/tt1731141/">
      <synopsis>
        Young Ender Wiggin is recruited by the International Military
        to lead the fight against the Formics, a genocidal alien race
        which nearly annihilated the human race in a previous invasion.
      </synopsis>
      <score>6.7</score>
    </imdb-info>
  </movie>
  <movie id="tt0816692">
    <title>Interstellar</title>
    <director>Christopher Nolan</director>
    <release-date>2014-11-07</release-date>
    <mpaa-rating>PG-13</mpaa-rating>
    <running-time>169</running-time>
    <genre>adventure</genre>
    <genre>drama</genre>
    <genre>scifi</genre>
    <cast>
      <person name="Matthew McConaughey" role="Cooper" />
      <person name="Anne Hathaway" role="Brand" />
      <person name="Jessica Chastain" role="Murph" />
      <person name="Michael Caine" role="Professor Brand" />
    </cast>
    <imdb-info url="http://www.imdb.com/title/tt0816692/">
      <synopsis>
        A team of explorers travel through a wormhole in space in an
        attempt to ensure humanity's survival.
      </synopsis>
      <score>8.6</score>
    </imdb-info>
  </movie>
  <movie id="tt3659388">
    <title>The Martian</title>
    <director>Ridley Scott</director>
    <release-date>2015-10-02</release-date>
    <mpaa-rating>PG-13</mpaa-rating>
    <running-time>144</running-time>
    <genre>adventure</genre>
    <genre>drama</genre>
    <genre>scifi</genre>
    <cast>
      <person name="Matt Damon" role="Mark Watney" />
      <person name="Jessica Chastain" role="Melissa Lewis" />
      <person name="Kristen Wiig" role="Annie Montrose" />
    </cast>
    <imdb-info url="http://www.imdb.com/title/tt3659388/">
      <synopsis>
        During a manned mission to Mars, Astronaut Mark Watney is
        presumed dead after a fierce storm and left behind by his crew.
        But Watney has survived and finds himself stranded and alone on
        the hostile planet. With only meager supplies, he must draw upon
        his ingenuity, wit and spirit to subsist and find a way to
        signal to Earth that he is alive.
      </synopsis>
      <score>8.1</score>
    </imdb-info>
  </movie>
</playlist>

Note

Although this XML document contains details which came from the fabulous IMDb.com web site, the file structure was created specifically for this example and does not represent an actual API for querying movie details.

Once you have the sample XML document, you can use this script to extract and print the title of each movie, in the order they appear in the XML:

#!/usr/bin/perl

use 5.010;
use strict;
use warnings;

use XML::LibXML;

my $filename = 'playlist.xml';

my $dom = XML::LibXML->load_xml(location => $filename);

foreach my $title ($dom->findnodes('/playlist/movie/title')) {
    say $title->to_literal();
}

and will produce the following output:

Apollo 13
Solaris
Ender's Game
Interstellar
The Martian

If we break the example down line-by-line we see that after a standard boilerplate section, the script loads the XML::LibXML module:

use XML::LibXML;

Next, the load_xml() class method is called to parse the XML file and return a document object:

my $dom = XML::LibXML->load_xml(location => $filename);

The $dom variable now contains an object representing all the elements of the XML document arranged in a tree structure known as a Document Object Model or ‘DOM’.

Finally we get to the guts of the script where the findnodes() method is called to search the DOM for the elements we’re interested in and a foreach loop is used to iterate through the matching elements:

foreach my $title ($dom->findnodes('/playlist/movie/title')) {
    say $title->to_literal();
}

The findnodes() method takes one argument - an XPath expression. This is a string describing the location and characteristics of the elements we want to find. XPath is a query language and the way we use it to select elements from the DOM is similar to the way we use SQL to select records from a relational database. The next section (XPath Expressions) will include examples of more complex queries.

The findnodes() method returns a list of objects from the DOM that match the XPath expression. Each time through the loop, $title will contain an object representing the next matching element. This object provides a number of properties and methods that you can use to access the element and its attributes, as well as any text content and ‘child’ elements.

Inside the loop, this example simply calls the to_literal() method to get the text content of the element. The string returned by to_literal() will not include any of the attributes but will include the text content of any child elements.

Other XML sources

The first example script called XML::LibXML->load_xml() with the location argument set to the name of a file. The location argument also accepts a URL:

$dom = XML::LibXML->load_xml(location => 'http://techcrunch.com/feed/');

Note

Not all versions of libxml2 can retrieve documents over SSL/TLS. So if the URL is an ‘https’ URL (or if it redirects to one), you may need to use a module like LWP to retrieve the document and pass the response body to the XML parser as a string as shown below.

If you have the XML in a string, instead of location, use string:

$dom = XML::LibXML->load_xml(string => $xml_string);

Or, you can provide a Perl file handle to parse from an open file or socket, using IO:

$dom = XML::LibXML->load_xml(IO => $fh);

When providing a string or a file handle, it’s crucial that you do not decode the bytes of the source data (for example by using ':utf8' when opening a file). The underlying libxml2 library is written in C to decode bytes and does not understand Perl’s character strings. If you have assembled your XML document by concatenating Perl character strings, you will need to encode it to a byte string (for example using Encode::encode_utf8()) and then pass the byte string to the parser.

If you have enabled UTF-8 globally with something like this in your script:

use open ':encoding(utf8)';

Then you’ll need to turn off the encoding IO layers for any file handle that you pass to XML::LibXML:

open my $fh, '<', $filename;
binmode $fh, ':raw';
$dom = XML::LibXML->load_xml(IO => $fh);

A more complex example

Now let’s look at a slightly more complex example. This script takes the same XML input and extracts more details from each <movie> element:

#!/usr/bin/perl

use 5.010;
use strict;
use warnings;

use XML::LibXML;

my $filename = 'playlist.xml';

my $dom = XML::LibXML->load_xml(location => $filename);

foreach my $movie ($dom->findnodes('//movie')) {
    say 'Title:    ', $movie->findvalue('./title');
    say 'Director: ', $movie->findvalue('./director');
    say 'Rating:   ', $movie->findvalue('./mpaa-rating');
    say 'Duration: ', $movie->findvalue('./running-time'), " minutes";
    my $cast = join ', ', map {
        $_->to_literal();
    } $movie->findnodes('./cast/person/@name');
    say 'Starring: ', $cast;
    say "";
}

and will produce the following output:

Title:    Apollo 13
Director: Ron Howard
Rating:   PG
Duration: 140 minutes
Starring: Tom Hanks, Bill Paxton, Kevin Bacon, Gary Sinise, Ed Harris

Title:    Solaris
Director: Steven Soderbergh
Rating:   PG-13
Duration: 99 minutes
Starring: George Clooney, Natascha McElhone, Ulrich Tukur

Title:    Ender's Game
Director: Gavin Hood
Rating:   PG-13
Duration: 114 minutes
Starring: Asa Butterfield, Harrison Ford, Hailee Steinfeld

Title:    Interstellar
Director: Christopher Nolan
Rating:   PG-13
Duration: 169 minutes
Starring: Matthew McConaughey, Anne Hathaway, Jessica Chastain, Michael Caine

Title:    The Martian
Director: Ridley Scott
Rating:   PG-13
Duration: 144 minutes
Starring: Matt Damon, Jessica Chastain, Kristen Wiig

Let’s compare the main loop of the first script:

foreach my $title ($dom->findnodes('/playlist/movie/title')) {
    say $title->to_literal();
}

with the main loop of the second script:

foreach my $movie ($dom->findnodes('//movie')) {
    say 'Title:    ', $movie->findvalue('./title');
    say 'Director: ', $movie->findvalue('./director');
    say 'Rating:   ', $movie->findvalue('./mpaa-rating');
    say 'Duration: ', $movie->findvalue('./running-time'), " minutes";
    my $cast = join ', ', map {
        $_->to_literal();
    } $movie->findnodes('./cast/person/@name');
    say 'Starring: ', $cast;
    say "";
}

The structure of the main loop is very similar but the XPath expression passed to findnodes() is different in each case:

'/playlist/movie/title'
Will match every <title> element which is the child of ...
a <movie> element which is the child of ...
a <playlist> element which is ...
the top-level element in the document.

Or, to phrase it a different way, the search will start at the top of the document and look for a <playlist> element; if one is found, the search will continue for child <movie> elements; and for each one that is found the search will continue for child <title> elements.

'//movie'
Will match every <movie> element at any level of nesting.

In both cases, the XPath expression starts with a ‘/’ which means the search will start at the the top of the document.

Inside the second script’s loop are a number of calls to findvalue(). This is a handy shortcut method that is typically used when you expect the XPath expression to match exactly one node. It combines the functionality of findnodes() and to_literal() into a single method. So this code:

$movie->findvalue('./title');

is equivalent to:

$movie->findnodes('./title')->to_literal();

There are a couple of other interesting differences with the XPath searches in the loop compared to previous examples. Firstly, the findvalue() method is being called on $movie (which represents one <movie> element) rather than on $dom (which represents the whole document). This means that the $movie element is the context element. Secondly, the XPath expression starts with a ‘.’ which means: start the search at the context element rather than at the top of the document.

This second script illustrates a common pattern when working with XML::LibXML:

  1. find ‘interesting’ elements using an XPath query starting with ‘/’ or ‘//’
  2. iterate through those elements in a foreach loop
  3. get additional data from child elements using XPath queries starting with ‘.’

Accessing attributes

When listing cast members in the main loop of the script above, this code ...

    my $cast = join ', ', map {
        $_->to_literal();
    } $movie->findnodes('./cast/person/@name');
    say 'Starring: ', $cast;

is used to transform this XML ...

1
2
3
4
5
<cast>
  <person name="Matt Damon" role="Mark Watney" />
  <person name="Jessica Chastain" role="Melissa Lewis" />
  <person name="Kristen Wiig" role="Annie Montrose" />
</cast>

into this output:

Starring: Matt Damon, Jessica Chastain, Kristen Wiig

In an XPath expression, a name that starts with @ will match an attribute rather than an element, so 'person/@name' refers to an attribute called name on a <person> element. In this case, the call to findnodes('./cast/person/@name') will return three DOM nodes representing attribute values which are then transformed into plain strings using to_literal(), as we’ve seen for element nodes, inside a map block.

Another approach is to select the element with XPath and then call a DOM method on the element node to get the attribute value:

    my $cast = join ', ', map {
        $_->getAttribute('name');
    } $movie->findnodes('./cast/person');
    say 'Starring: ', $cast;

Attributes via tied hash

There’s a shortcut syntax you can use to make this even easier, simply treat the element node as a hashref:

    my $cast = join ', ', map {
        $_->{name};
    } $movie->findnodes('./cast/person');
    say 'Starring: ', $cast;

You might be a bit wary of poking around directly inside the element object, rather than using accessor methods. But don’t worry, that’s not what this shortcut syntax is doing. Instead, every XML::LibXML::Element object returned from the XPath query has been ‘tied’ using XML::LibXML::AttributeHash so that hash lookups ‘inside’ the object actually get proxied to getAttribute() method calls.

This technique is less efficient than calling getAttribute() directly but it is very convenient when you want to access more than one attribute of an element or when you want to interpolate an attribute value into a string:

    my $cast = join "\n", map {
        " * $_->{name} (as $_->{role})";
    } $movie->findnodes('./cast/person');
    say "\nStarring:\n", $cast;

Which will produce this output:

Starring:
 * Matt Damon (as Mark Watney)
 * Jessica Chastain (as Melissa Lewis)
 * Kristen Wiig (as Annie Montrose)

Note

Overloading ‘Element’ nodes to support tied hash access to attribute values was added in version 1.91 of XML::LibXML. If the examples above don’t work for you then it may be because you have a very old version installed.

Parsing Errors

One of the advantages of XML is that it has a few strict rules that every document must comply with to be considered “well-formed”. If a document is not well-formed, it should be rejected in its entirety and no part of the XML document content should be used. Examples of things that would cause a document to be not well-formed include:

  • missing or mismatched closing tag
  • missing or mismatched quotes around attribute values
  • whitespace before the initial XML declaration section
  • byte sequences that do not match the document’s declared character encoding
  • any non-whitespace characters after the closing tag for the first top-level element

Like pretty much all XML parser modules, libxml will throw an exception if it encounters any violations of these rules. Since the whole of the XML document is processed when load_xml is called, an error at any point in the document will cause an exception to be raised.

If you wish to handle exceptions gracefully use must use an eval block or one of the “try/catch” syntax extension modules to catch the error. For example, this document contains an error:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
<?xml version='1.0' encoding='UTF-8' standalone="yes" ?>
<book edition="2">
  <title>Training Your Pet Ferret</title>
  <authors>
    <author>Gerry Bucsis</author>
    <author>Barbara Somerville</author>
  </authors>
  <isbn>9780764142239</isnb>
  <dimensions width="162.56mm" height="195.58mm" depth="10.16mm" pages="96" />
</book>

This script will attempt to parse the bad input:

#my $filename = 'book.xml';
my $filename = 'book-borkened.xml';

my $dom = eval {
    XML::LibXML->load_xml(location => $filename);
};
if($@) {
    # Log failure and exit
    print "Error parsing '$filename':\n$@";
    exit 0;
}

foreach my $author ($dom->findnodes('//author')) {
    say $author->to_literal();

and will instead produce this output:

Error parsing 'book-borkened.xml':
book-borkened.xml:8: parser error : Opening and ending tag mismatch: isbn line 8 and isnb
  <isbn>9780764142239</isnb>
                            ^

Note that although the script is only looking for <author> elements and the error in the <isbn> element comes after all the <author> elements, an exception is still raised by the load_xml call inside the eval block, before the DOM has been fully constructed.

That’s it for the basic examples. The next topic will look more closely at XPath expressions.