Working with HTML

If you ever need to extract text and data from HTML documents, the libxml parser and DOM provide very useful tools. You might imagine that libxml would only work with XHTML and even then only strictly well-formed documents. In fact, the parser has an HTML mode that handles unclosed tags like <img> and <br> and is even able to recover from parse errors caused by poorly formed HTML.

Let’s start with this mess of HTML tag soup:

<html><head><title>Example (Untidy) HTML Doc</title></head>
<body><p>Here's a paragraph with <i>poorly <b>nested</i></b>
tags.  Followed by a list of items &mdash; with unclosed tags</p>
<ul><li>red</li><li>orange<li>yellow</ul></body></html>

To read the file in, you’d use the load_html() method rather than load_xml(). You’ll almost certainly want to use the recover => 1 option to tell the parser to try to recover from parse errors and carry on to produce a DOM.

#!/usr/bin/perl

use 5.010;
use strict;
use warnings;

use XML::LibXML;

my $filename = 'untidy.html';

my $dom = XML::LibXML->load_html(
    location  => $filename,
    recover   => 1,
);

say $dom->toStringHTML();

When the DOM is serialised with toStringHTML(), some rudimentary formatting is applied automatically. Unfortunately there is no option to add indenting to the HTML output:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><title>Example (Untidy) HTML Doc</title></head>
<body>
<p>Here's a paragraph with <i>poorly <b>nested</b></i>
tags.  Followed by a list of items &mdash; with unclosed tags</p>
<ul>
<li>red</li>
<li>orange</li>
<li>yellow</li>
</ul>
</body>
</html>

While the document is being parsed, you’ll see messages like this on STDERR:

untidy.html:2: HTML parser error : Opening and ending tag mismatch: i and b
<body><p>Here's a paragraph with <i>poorly <b>nested</i></b>
                                                        ^
untidy.html:2: HTML parser error : Unexpected end tag : b
<body><p>Here's a paragraph with <i>poorly <b>nested</i></b>
                                                            ^

You can turn off the error output with the suppress_errors option:

my $dom = XML::LibXML->load_html(
    location        => $filename,
    recover         => 1,
    suppress_errors => 1,
);

That option doesn’t seem to work with all versions of XML::LibXML so you may want to use a routine like this that sends STDERR to /dev/null during parsing, but still allows other output to STDERR when the parse function returns:

use File::Spec;

sub parse_html_file {
    my($filename) = @_;

    local(*STDERR);
    open STDERR, '>>', File::Spec->devnull();
    return XML::LibXML->load_html(
        location        => $filename,
        recover         => 1,
        suppress_errors => 1,
    );
};

Querying HTML with XPath

The main tool you’ll use for extracting data from HTML is the findnodes() method that was introduced in A Basic Example and XPath Expressions. For these examples, the source HTML comes from the CSS Zen Garden Project and is in the file css-zen-garden.html.

This script locates every <h3> element inside the <div> with an id attribute value of "zen-supporting":

my $filename = 'css-zen-garden.html';

my $dom = XML::LibXML->load_html(
    location        => $filename,
    recover         => 1,
    suppress_errors => 1,
);

my $xpath = '//div[@id="zen-supporting"]//h3';
say "$_" foreach $dom->findnodes($xpath)->to_literal_list;

Output:

So What is This About?
Participation
Benefits
Requirements

For a more complex example, the next script iterates through each <li> in the “Select a Design” section and extracts three items of information for each: the name of the design, the name of the designer, and a link to view the design. Once the information has been collected, it is dumped out in JSON format:

use XML::LibXML;
use URI::URL;
use JSON qw(to_json);

my $base_url = 'http://csszengarden.com/';
my $filename = 'css-zen-garden.html';

my $dom = XML::LibXML->load_html(
    location        => $filename,
    recover         => 1,
    suppress_errors => 1,
);

my @designs;
my $xpath = '//div[@id="design-selection"]//li';
foreach my $design ($dom->findnodes($xpath)) {
    my($name, $designer) = $design->findnodes('./a')->to_literal_list;
    my($url) = $design->findnodes('./a/@href')->to_literal_list;
    $url = URI::URL->new($url, $base_url)->abs;
    push @designs, {
        name      => $name,
        designer  => $designer,
        url       => "$url",
    };
}

say to_json(\@designs, {pretty => 1});

Output:

[
   {
      "designer" : "Andrew Lohman",
      "url" : "http://csszengarden.com/221/",
      "name" : "Mid Century Modern"
   },
   {
      "name" : "Garments",
      "url" : "http://csszengarden.com/220/",
      "designer" : "Dan Mall"
   },
   {
      "name" : "Steel",
      "designer" : "Steffen Knoeller",
      "url" : "http://csszengarden.com/219/"
   },
   {
      "designer" : "Trent Walton",
      "url" : "http://csszengarden.com/218/",
      "name" : "Apothecary"
   },
   {
      "name" : "Screen Filler",
      "designer" : "Elliot Jay Stocks",
      "url" : "http://csszengarden.com/217/"
   },
   {
      "name" : "Fountain Kiss",
      "designer" : "Jeremy Carlson",
      "url" : "http://csszengarden.com/216/"
   },
   {
      "name" : "A Robot Named Jimmy",
      "designer" : "meltmedia",
      "url" : "http://csszengarden.com/215/"
   },
   {
      "name" : "Verde Moderna",
      "designer" : "Dave Shea",
      "url" : "http://csszengarden.com/214/"
   }
]

In both these examples we were fortunate to be dealing with ‘semantic markup’ – where sections of the document could be readily identified using id attributes. If there were no id attributes, we could change the XPath expression to select using element text content instead:

my $xpath = '//h3[contains(.,"Select a Design")]/..//li';

This XPath expression first looks for an <h3> element that contains the text 'Select a Design'. It then uses /.. to find that element’s parent (a <div> in the example document) and then uses //li to find all <li> elements contained within the parent.

Another common problem is finding that although your XPath expressions do match the content you want, they also match content you don’t want – for example from a block of navigation links. In these cases you might identify a block of uninteresting content using findnodes() and then use removeChild() to remove that whole section from the DOM before running your main XPath query. Because you’re only removing the nodes from the in-memory copy of the document, the original source remains unchanged. This technique is used in the spell-check script used to find typos in this document.

Matching class names

An HTML element can have multiple classes applied to it by using a space-separated list in the class attribute. Some care is needed to ensure your XPath expressions always match one whole class name from the list. For example, if you were trying to match <li> elements with the class member, you might try something like:

$xpath = '//li[contains(@class, "member")]';

which will match an element like this:

    <li class="member">Catherine Trenton</li>

but it will also match an element like this:

    <li class="non-member">Daniel Ifflehirst</li>

The most common way to solve the problem is to add an extra space to the beginning and the end of the class attribute value like this: concat(" ", @class, " ") and then add spaces around the classname we’re looking for: ' member '. Giving a expression like this:

$xpath = '//li[contains(concat(" ", @class, " "), " member ")]';

Using CSS-style selectors

The XPath expression in the last example is an effective way to select elements by class name, but the syntax is very unwieldy compared to CSS selectors. For example, the CSS selector to match elements with the class name member would simply be: .member

Wouldn’t it be great if there was a way to provide a CSS selector and have it converted into an XPath expression that you could pass to findnodes()? Well it turns out that’s exactly what the HTML::Selector::XPath module does:

use HTML::Selector::XPath qw(selector_to_xpath);

sub find_by_css {
    my($dom, $selector) = @_;
    my $xpath = selector_to_xpath($selector);
    return $dom->findnodes($xpath);
}

Some example inputs (“Selector”) and outputs (“XPath”):

Selector: #zen-supporting h3
XPath:    //*[@id='zen-supporting']//h3

Selector: .designer-name
XPath:    //*[contains(concat(' ', normalize-space(@class), ' '), ' designer-name ')]

Selector: .preamble abbr
XPath:    //*[contains(concat(' ', normalize-space(@class), ' '), ' preamble ')]//abbr

Selector: .preamble h3, .requirements h3
XPath:    //*[contains(concat(' ', normalize-space(@class), ' '), ' preamble ')]//h3 | //*[contains(concat(' ', normalize-space(@class), ' '), ' requirements ')]//h3