Working with HTML¶
If you ever need to extract text and data from HTML documents, the libxml
parser and DOM provide very useful tools. You might imagine that libxml
would only work with XHTML and even then only strictly well-formed documents.
In fact, the parser has an HTML mode that handles unclosed tags like <img>
and <br>
and is even able to recover from parse errors caused by poorly
formed HTML.
Let’s start with this mess of HTML tag soup:
<html><head><title>Example (Untidy) HTML Doc</title></head>
<body><p>Here's a paragraph with <i>poorly <b>nested</i></b>
tags. Followed by a list of items — with unclosed tags</p>
<ul><li>red</li><li>orange<li>yellow</ul></body></html>
To read the file in, you’d use the load_html()
method rather than
load_xml()
. You’ll almost certainly want to use the recover => 1
option to tell the parser to try to recover from parse errors and carry on to
produce a DOM.
#!/usr/bin/perl
use 5.010;
use strict;
use warnings;
use XML::LibXML;
my $filename = 'untidy.html';
my $dom = XML::LibXML->load_html(
location => $filename,
recover => 1,
);
say $dom->toStringHTML();
When the DOM is serialised with toStringHTML()
, some rudimentary formatting
is applied automatically. Unfortunately there is no option to add indenting
to the HTML output:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><title>Example (Untidy) HTML Doc</title></head>
<body>
<p>Here's a paragraph with <i>poorly <b>nested</b></i>
tags. Followed by a list of items — with unclosed tags</p>
<ul>
<li>red</li>
<li>orange</li>
<li>yellow</li>
</ul>
</body>
</html>
While the document is being parsed, you’ll see messages like this on STDERR:
untidy.html:2: HTML parser error : Opening and ending tag mismatch: i and b
<body><p>Here's a paragraph with <i>poorly <b>nested</i></b>
^
untidy.html:2: HTML parser error : Unexpected end tag : b
<body><p>Here's a paragraph with <i>poorly <b>nested</i></b>
^
You can turn off the error output with the suppress_errors
option:
my $dom = XML::LibXML->load_html(
location => $filename,
recover => 1,
suppress_errors => 1,
);
That option doesn’t seem to work with all versions of XML::LibXML
so you
may want to use a routine like this that sends STDERR to /dev/null
during
parsing, but still allows other output to STDERR when the parse function
returns:
use File::Spec;
sub parse_html_file {
my($filename) = @_;
local(*STDERR);
open STDERR, '>>', File::Spec->devnull();
return XML::LibXML->load_html(
location => $filename,
recover => 1,
suppress_errors => 1,
);
};
Querying HTML with XPath¶
The main tool you’ll use for extracting data from HTML is the findnodes()
method that was introduced in A Basic Example and XPath Expressions. For these
examples, the source HTML comes from the CSS Zen Garden Project and is in the file css-zen-garden.html
.
This script locates every <h3>
element inside the <div>
with an id
attribute value of "zen-supporting"
:
my $filename = 'css-zen-garden.html';
my $dom = XML::LibXML->load_html(
location => $filename,
recover => 1,
suppress_errors => 1,
);
my $xpath = '//div[@id="zen-supporting"]//h3';
say "$_" foreach $dom->findnodes($xpath)->to_literal_list;
Output:
So What is This About?
Participation
Benefits
Requirements
For a more complex example, the next script iterates through each <li>
in
the “Select a Design” section and extracts three items of information for each:
the name of the design, the name of the designer, and a link to view the
design. Once the information has been collected, it is dumped out in JSON
format:
use XML::LibXML;
use URI::URL;
use JSON qw(to_json);
my $base_url = 'http://csszengarden.com/';
my $filename = 'css-zen-garden.html';
my $dom = XML::LibXML->load_html(
location => $filename,
recover => 1,
suppress_errors => 1,
);
my @designs;
my $xpath = '//div[@id="design-selection"]//li';
foreach my $design ($dom->findnodes($xpath)) {
my($name, $designer) = $design->findnodes('./a')->to_literal_list;
my($url) = $design->findnodes('./a/@href')->to_literal_list;
$url = URI::URL->new($url, $base_url)->abs;
push @designs, {
name => $name,
designer => $designer,
url => "$url",
};
}
say to_json(\@designs, {pretty => 1});
Output:
[
{
"designer" : "Andrew Lohman",
"url" : "http://csszengarden.com/221/",
"name" : "Mid Century Modern"
},
{
"name" : "Garments",
"url" : "http://csszengarden.com/220/",
"designer" : "Dan Mall"
},
{
"name" : "Steel",
"designer" : "Steffen Knoeller",
"url" : "http://csszengarden.com/219/"
},
{
"designer" : "Trent Walton",
"url" : "http://csszengarden.com/218/",
"name" : "Apothecary"
},
{
"name" : "Screen Filler",
"designer" : "Elliot Jay Stocks",
"url" : "http://csszengarden.com/217/"
},
{
"name" : "Fountain Kiss",
"designer" : "Jeremy Carlson",
"url" : "http://csszengarden.com/216/"
},
{
"name" : "A Robot Named Jimmy",
"designer" : "meltmedia",
"url" : "http://csszengarden.com/215/"
},
{
"name" : "Verde Moderna",
"designer" : "Dave Shea",
"url" : "http://csszengarden.com/214/"
}
]
In both these examples we were fortunate to be dealing with ‘semantic markup’
– where sections of the document could be readily identified using id
attributes. If there were no id
attributes, we could change the XPath
expression to select using element text content instead:
my $xpath = '//h3[contains(.,"Select a Design")]/..//li';
This XPath expression first looks for an <h3>
element that contains the
text 'Select a Design'
. It then uses /..
to find that element’s
parent (a <div>
in the example document) and then uses //li
to find
all <li>
elements contained within the parent.
Another common problem is finding that although your XPath expressions do match
the content you want, they also match content you don’t want – for example
from a block of navigation links. In these cases you might identify a block of
uninteresting content using findnodes()
and then use removeChild()
to
remove that whole section from the DOM before running your main
XPath query. Because you’re only removing the nodes from the in-memory copy
of the document, the original source remains unchanged. This technique is
used in the spell-check script
used
to find typos in this document.
Matching class names¶
An HTML element can have multiple classes applied to it by using a
space-separated list in the class
attribute. Some care is needed to ensure
your XPath expressions always match one whole class name from the list. For
example, if you were trying to match <li>
elements with the class
member
, you might try something like:
$xpath = '//li[contains(@class, "member")]';
which will match an element like this:
<li class="member">Catherine Trenton</li>
but it will also match an element like this:
<li class="non-member">Daniel Ifflehirst</li>
The most common way to solve the problem is to add an extra space to the
beginning and the end of the class
attribute value like this: concat("
", @class, " ")
and then add spaces around the classname we’re looking for:
' member '
. Giving a expression like this:
$xpath = '//li[contains(concat(" ", @class, " "), " member ")]';
Using CSS-style selectors¶
The XPath expression in the last example is an effective way to select elements
by class name, but the syntax is very unwieldy compared to CSS selectors. For
example, the CSS selector to match elements with the class name member
would simply be: .member
Wouldn’t it be great if there was a way to provide a CSS selector and have it
converted into an XPath expression that you could pass to findnodes()
?
Well it turns out that’s exactly what the HTML::Selector::XPath module does:
use HTML::Selector::XPath qw(selector_to_xpath);
sub find_by_css {
my($dom, $selector) = @_;
my $xpath = selector_to_xpath($selector);
return $dom->findnodes($xpath);
}
Some example inputs (“Selector”) and outputs (“XPath”):
Selector: #zen-supporting h3
XPath: //*[@id='zen-supporting']//h3
Selector: .designer-name
XPath: //*[contains(concat(' ', normalize-space(@class), ' '), ' designer-name ')]
Selector: .preamble abbr
XPath: //*[contains(concat(' ', normalize-space(@class), ' '), ' preamble ')]//abbr
Selector: .preamble h3, .requirements h3
XPath: //*[contains(concat(' ', normalize-space(@class), ' '), ' preamble ')]//h3 | //*[contains(concat(' ', normalize-space(@class), ' '), ' requirements ')]//h3