.. highlight:: none
:linenothreshold: 1
Working with HTML
=================
If you ever need to extract text and data from HTML documents, the ``libxml``
parser and DOM provide very useful tools. You might imagine that ``libxml``
would only work with XHTML and even then only strictly well-formed documents.
In fact, the parser has an HTML mode that handles unclosed tags like ````
and ``
`` and is even able to recover from parse errors caused by poorly
formed HTML.
Let's start with this mess of HTML tag soup:
.. literalinclude:: /code/untidy.html
:language: none
To read the file in, you'd use the ``load_html()`` method rather than
``load_xml()``. You'll almost certainly want to use the ``recover => 1``
option to tell the parser to try to recover from parse errors and carry on to
produce a DOM.
.. literalinclude:: /code/500-html-tidy.pl
:language: perl
When the DOM is serialised with ``toStringHTML()``, some rudimentary formatting
is applied automatically. Unfortunately there is no option to add indenting
to the HTML output:
.. literalinclude:: /_output/500-html-tidy.pl-out
:language: none
While the document is being parsed, you'll see messages like this on STDERR:
.. literalinclude:: /_output/500-html-tidy.pl-err
:language: none
You can turn off the error output with the ``suppress_errors`` option:
.. literalinclude:: /code/501-html-tidy-no-err.pl
:language: perl
:lines: 11-15
That option doesn't seem to work with all versions of ``XML::LibXML`` so you
may want to use a routine like this that sends STDERR to ``/dev/null`` during
parsing, but still allows other output to STDERR when the parse function
returns:
.. literalinclude:: /code/510-html-no-stderr.pl
:language: perl
:lines: 7,17-28
Querying HTML with XPath
------------------------
The main tool you'll use for extracting data from HTML is the ``findnodes()``
method that was introduced in :doc:`basics` and :doc:`xpath`. For these
examples, the source HTML comes from the `CSS Zen Garden Project
`_ and is in the file :download:`css-zen-garden.html
`.
This script locates every ``
`` element inside the ```` with an ``id``
attribute value of ``"zen-supporting"``:
.. literalinclude:: /code/520-html-xpath-simple.pl
:language: perl
:lines: 9-18
Output:
.. literalinclude:: /_output/520-html-xpath-simple.pl-out
:language: none
For a more complex example, the next script iterates through each ``
`` in
the "Select a Design" section and extracts three items of information for each:
the name of the design, the name of the designer, and a link to view the
design. Once the information has been collected, it is dumped out in JSON
format:
.. literalinclude:: /code/530-html-xpath-complex.pl
:language: perl
:lines: 7-33
Output:
.. literalinclude:: /_output/530-html-xpath-complex.pl-out
:language: json
In both these examples we were fortunate to be dealing with 'semantic markup'
-- where sections of the document could be readily identified using ``id``
attributes. If there were no ``id`` attributes, we could change the XPath
expression to select using element text content instead:
.. literalinclude:: /code/531-html-xpath-no-semantic.pl
:language: perl
:lines: 21
This XPath expression first looks for an ```` element that contains the
text ``'Select a Design'``. It then uses ``/..`` to find that element's
parent (a ``
`` in the example document) and then uses ``//li`` to find
all ``
`` elements contained within the parent.
Another common problem is finding that although your XPath expressions do match
the content you want, they also match content you don't want -- for example
from a block of navigation links. In these cases you might identify a block of
uninteresting content using ``findnodes()`` and then use ``removeChild()`` to
remove that whole section from the :doc:`DOM ` before running your main
XPath query. Because you're only removing the nodes from the in-memory copy
of the document, the original source remains unchanged. This technique is
used in the :download:`spell-check script ` used
to find typos in this document.
Matching class names
--------------------
An HTML element can have multiple classes applied to it by using a
space-separated list in the ``class`` attribute. Some care is needed to ensure
your XPath expressions always match one whole class name from the list. For
example, if you were trying to match ```` elements with the class
``member``, you might try something like:
.. literalinclude:: /code/540-html-xpath-classes.pl
:language: perl
:lines: 18
which will match an element like this:
.. literalinclude:: /code/people.html
:language: html
:lines: 9
but it will also match an element like this:
.. literalinclude:: /code/people.html
:language: html
:lines: 10
The most common way to solve the problem is to add an extra space to the
beginning and the end of the ``class`` attribute value like this: ``concat("
", @class, " ")`` and then add spaces around the classname we're looking for:
``' member '``. Giving a expression like this:
.. literalinclude:: /code/540-html-xpath-classes.pl
:language: perl
:lines: 25
Using CSS-style selectors
-------------------------
The XPath expression in the last example is an effective way to select elements
by class name, but the syntax is very unwieldy compared to CSS selectors. For
example, the CSS selector to match elements with the class name ``member``
would simply be: ``.member``
Wouldn't it be great if there was a way to provide a CSS selector and have it
converted into an XPath expression that you could pass to ``findnodes()``?
Well it turns out that's exactly what the `HTML::Selector::XPath
`_ module does:
.. literalinclude:: /code/580-html-css-selectors.pl
:language: perl
:lines: 8-9,37-41
Some example inputs ("Selector") and outputs ("XPath"):
.. literalinclude:: /_output/580-html-css-selectors.pl-out
:language: none