.. highlight:: none :linenothreshold: 1 The Document Object Model ========================= The :doc:`basic examples ` section introduced the ``findnodes()`` method and XPath expressions for extracting parts of an XML document. For most applications, that's pretty much all you need, but sometimes it's necessary to use lower-level methods and to understand the relationships between different parts of the document. The XML::LibXML module implements Perl bindings for the `W3C Document Object Model `_. The W3C DOM defines object classes, properties and methods for querying and manipulating the different parts of an XML (or HTML) document. In the Perl implementation, object properties are exposed via accessor methods. Let's start our exploration of the DOM with a simple XML document which describes `a book `_ - :download:`book.xml ` .. literalinclude:: /code/book.xml :language: xml :linenos: When you ask XML::LibXML to parse the document, it creates an object to represent each part of the document and assembles those objects into a hierarchy as shown here: .. figure:: ../images/dom.png :class: illustration :alt: An XML document represented as a Document Object Model A simplified representation of the Document Object Model. The source XML document has a ```` element which contains four other elements: ````, ``<authors>``, ``<isbn>`` and ``<dimensions>``. The ``<authors>`` element in turn contains two ``<author>`` elements. The hierarchy in the picture shows us that ``<book>`` has four "child" elements. Similarly, ``<authors>`` has two child elements and one "parent" element (``<book>``). Five of the elements have no child elements but four of them do contain text content and one has some attributes. The 'Document' object --------------------- When you parse a document with XML::LibXML the parser returns a 'Document' object - represented in yellow in the picture above. The reference documentation for the `XML::LibXML::Document <https://metacpan.org/pod/XML::LibXML::Document>`_ class lists methods you can use to interact with the document. The 'Document' class inherits from the 'Node' class so you'll also need to refer to the docs for `XML::LibXML::Node <https://metacpan.org/pod/XML::LibXML::Node>`_ as well. .. literalinclude:: /code/200-dom-document.pl :language: perl :lines: 9-12 Output: .. literalinclude:: /_output/200-dom-document.pl-out :language: none :lines: 1-2 The document object also provides methods you can use to extract information from the XML declaration section - the very first line of the source XML, which precedes the ``<book>`` element: .. literalinclude:: /code/200-dom-document.pl :language: perl :lines: 13-16 Output: .. literalinclude:: /_output/200-dom-document.pl-out :language: none :lines: 3-5 You can serialise a whole DOM back out to XML by calling the ``toString()`` method on the document object: .. literalinclude:: /code/200-dom-document.pl :language: perl :lines: 18 The document class also overrides the stringification operator, so if you simply treat the object as a string and print it out you'll also get the serialised XML: .. literalinclude:: /code/200-dom-document.pl :language: perl :lines: 20 'Element' objects ----------------- The blue boxes in the picture represent 'Element' nodes. The reference documentation for the `XML::LibXML::Element <https://metacpan.org/pod/XML::LibXML::Element>`_ class lists a number of methods, but like the 'Document' class, many more methods are inherited from `XML::LibXML::Node <https://metacpan.org/pod/XML::LibXML::Node>`_. Every XML document has one single top-level element known as the "document element" that encloses all the other elements - in our example it's the ``<book>`` element. You can retrieve this element by calling the ``documentElement()`` method on the document object and you can determine what type of element it is by calling ``nodeName()``: .. literalinclude:: /code/210-dom-elements.pl :language: perl :lines: 11-14 Output: .. literalinclude:: /_output/210-dom-elements.pl-out :language: none :lines: 1-2 The ``<book>`` element has four child elements. You can use ``getChildrenByTagName()`` to get a list of all the child elements with a specific element name (this is not a recursive search, it only looks through elements which are direct children): .. literalinclude:: /code/210-dom-elements.pl :language: perl :lines: 15-19 Output: .. literalinclude:: /_output/210-dom-elements.pl-out :language: none :lines: 3-6 If you're not looking for one specific type of element, you can get all the children with ``childNodes()``: .. literalinclude:: /code/210-dom-elements.pl :language: perl :lines: 21-27 We already know that ``<book>`` contains four child elements, so you may be a little surprised to see ``childNodes()`` returns a list of nine nodes: .. literalinclude:: /_output/210-dom-elements.pl-out :language: none :lines: 7-16 If you refer back to the source XML document, you can see that after the ``<book>`` tag and before the ``<title>`` tag there is some whitespace: a line-feed character followed by two spaces at the start of the next line: .. literalinclude:: /code/book.xml :language: xml :linenos: :lines: 2-3 These strings of whitespace are represented in the DOM by 'Text' nodes, which are children of the parent element. So a more accurate DOM diagram would look like this: .. figure:: ../images/dom-full.png :class: illustration :alt: Document Object Model including whitespace-only text nodes Document Object Model including whitespace-only text nodes If you want to filter child nodes by type, XML::LibXML provides a number of constants which you can import when you load the module: .. literalinclude:: /code/210-dom-elements.pl :language: perl :lines: 7 And then you can compare ``$node->nodeType`` to these constants: .. literalinclude:: /code/210-dom-elements.pl :language: perl :lines: 29-35 Output: .. literalinclude:: /_output/210-dom-elements.pl-out :language: none :lines: 17-21 That technique is useful for the general case of filtering child nodes by type, but if you simply want to exclude text nodes that contain only whitespace, you can do that by specifying the ``no_blanks`` option when parsing the source document. This causes ``libxml`` to discard those 'blank' text nodes rather than adding them into the DOM: .. literalinclude:: /code/211-dom-elements-no-blanks.pl :language: perl :lines: 9 Output: .. literalinclude:: /_output/211-dom-elements-no-blanks.pl-out :language: none Blank text nodes are really only a problem if you use the low-level DOM methods for walking through child nodes. You'll generally find that it's much easier to just use ``findnodes()`` and :doc:`xpath` to select exactly the elements or other nodes you want. If the blank nodes don't match your selector then they won't be returned in the result set. 'Text' objects -------------- The green boxes in the picture represent 'Text' nodes. The reference documentation for the `XML::LibXML::Text <https://metacpan.org/pod/XML::LibXML::Text>`_ class lists a small number of methods and many more are inherited from the Node class. There are numerous ways to get the text string out of a Text object but it's important to be clear on whether you want the text as it appears in the XML (including any entity escaping) or whether you want the plain text that the source represents. Consider this tiny source document: .. literalinclude:: /code/fish-and-chips.xml :language: xml :linenos: :lines: 1 And these different methods for accessing the text: .. literalinclude:: /code/220-dom-text-nodes.pl :language: perl :lines: 11-19 Producing this output: .. literalinclude:: /_output/220-dom-text-nodes.pl-out :language: none :lines: 1-6 The ``data()`` and ``nodeValue()`` methods are essentially aliases. The ``to_literal()`` method produces the same output via a more complex route, but has the advantage that you can call it on any object in the DOM. The ``toString()`` method is really only useful for serialising a whole DOM or a DOM fragment out to XML. Stringification is particularly handy when you just want to print an object out for debugging purposes. 'Attr' objects -------------- The red boxes in the picture represent attributes. You're unlikely to ever need to deal with attribute **objects** since it's easier to get and set attribute values by calling methods on an Element object and passing in plain string values. An even easier approach is to use the :ref:`tied hash interface <tied-attribute-hash>` that allows you to treat each element as if it were a hashref and access attribute values via hash keys: .. literalinclude:: /code/230-dom-attributes.pl :language: perl :lines: 11-15 Output: .. literalinclude:: /_output/230-dom-attributes.pl-out :language: none :lines: 1-2 The class name for the attribute objects is 'Attr' - the unfortunate truncation of the class name derives from the `W3C DOM spec <https://www.w3.org/TR/DOM-Level-3-Core/core.html#ID-637646024>`_. The reference documentation is at: `XML::LibXML::Attr <https://metacpan.org/pod/XML::LibXML::Attr>`_. Some additional methods are inherited from the Node class but not all the Node methods work with Attr objects (once again due to behaviour specified by the W3C DOM). .. literalinclude:: /code/240-dom-attr.pl :language: perl :lines: 11-22 Output: .. literalinclude:: /_output/240-dom-attr.pl-out :language: none :lines: 1-4 'NodeList' objects ------------------ The 'NodeList' object is a part of the DOM that makes sense in DOM implementations for other languages (e.g.: Java) but doesn't make much sense in Perl. Methods such as ``childNodes()`` or ``findnodes()`` that may need to return multiple nodes, return a 'NodeList' object which contains the matching nodes and allows the caller to iterate through the result set: .. literalinclude:: /code/250-dom-nodelist.pl :language: perl :lines: 13-19 Output: .. literalinclude:: /_output/250-dom-nodelist.pl-out :language: none :lines: 1-5 But things don't need to be that complicated in Perl - if a method needs to return a list of values then it can just return a list of values. So the Perl bindings for DOM methods that would return a NodeList check the calling context. If called in a scalar context, they return a NodeList object (as above) but in a list context they just return the list of values - much simpler: .. literalinclude:: /code/250-dom-nodelist.pl :language: perl :lines: 23-25 When you execute a search that you expect should match exactly one node, take care to still use list context: .. literalinclude:: /code/250-dom-nodelist.pl :language: perl :lines: 29-31 Output: .. literalinclude:: /_output/250-dom-nodelist.pl-out :language: none :lines: 12-13 In this example, the assignment ``my($dim) = ...`` uses parentheses to force list context, so ``findnodes()`` will return a list of Element nodes and the first will be assigned to ``$dim``. Without the parentheses, a NodeList would be assigned to ``$dim``. If for some reason you find yourself with a NodeList object you can extract the contents as a simple list with ``$result->get_nodelist``. The NodeList object does implement the ``to_literal()`` method, which returns the text content of all the nodes, concatenated together as a single string. If you need a list of individual string values, you can use ``$result->to_literal_list()``: .. literalinclude:: /code/250-dom-nodelist.pl :language: perl :lines: 35 Output: .. literalinclude:: /_output/250-dom-nodelist.pl-out :language: none :lines: 15 Modifying the DOM ----------------- If you wish to modify the DOM, you can create new nodes and add them into the node hierarchy in the appropriate place. You can also modify, move and delete existing nodes. Let's start with a simple XML document: .. literalinclude:: /code/260-dom-modification.pl :language: perl :lines: 9-14 Navigate to the ``<event>`` element; change its text content; add an attribute and print out the resulting XML: .. literalinclude:: /code/260-dom-modification.pl :language: perl :lines: 16-21 Output: .. literalinclude:: /_output/260-dom-modification.pl-out :language: none :lines: 1-4 You can use ``$dom->createElement`` to create a new element and then add it to an existing node's list of child nodes. You can append it to the end of the list of children or insert it before/after a specific existing child: .. literalinclude:: /code/260-dom-modification.pl :language: perl :lines: 25-33 Output: .. literalinclude:: /_output/260-dom-modification.pl-out :language: none :lines: 7-10 Unfortunately that output is probably messier than you were expecting. To get nicely indented XML output, you'd need to create text nodes containing a newline and the appropriate number of spaces for indentation; and then add those text nodes in before each new element. Or, an easier way would be to pass the numeric value ``1`` to the ``toString()`` method as a flag indicating that you'd like the output auto-indented: .. literalinclude:: /code/260-dom-modification.pl :language: perl :lines: 35 Output: .. literalinclude:: /_output/260-dom-modification.pl-out :language: none :lines: 12-15 But sadly that didn't seem to work. The ``libxml`` library won't add indentation to *'mixed content'* - an element whose list of child nodes contains a mixture of both Element nodes and Text nodes. In this case the ``<record>`` element contains mixed content (there's a whitespace text node before the ``<event>`` and another after it) so ``libxml`` does not try to indent its contents. If we strip out those extra text nodes then libxml will add indenting: .. literalinclude:: /code/260-dom-modification.pl :language: perl :lines: 39-42 Output: .. literalinclude:: /_output/260-dom-modification.pl-out :language: none :lines: 17-22 While that did work, it required some rather specific knowledge of the document structure. We were relying on knowing that all the text children of the ``<record>`` element were whitespace-only and could be discarded. Here's a more generic approach which searches recursively through the document and deletes every text node that contains only whitespace: .. literalinclude:: /code/260-dom-modification.pl :language: perl :lines: 48-50 That code is a little tricky so some explanation is probably in order: * The loop does not declare a loop variable, so ``$_`` is used implicitly. * The trailing ``unless`` clause runs a regex comparison against ``$_`` which implicitly calls ``toString()`` on the Text node. * ``unless /\S/`` is a double negative which means *"unless the text contains a non-whitespace character"*. * the ``removeChild()`` method needs to be called on the *parent* of the node we're removing, so if the Text node is whitespace-only then we need to use ``parentNode()``. Of course an even simpler solution in this case would have been to turn on the ``no_blanks`` option (described earlier) when parsing the initial XML document. Another handy method for adding to the DOM is ``appendWellBalancedChunk()``. This method takes a string containing a fragment of XML. It must be well balanced - each opening tag must have a matching closing tag and elements must be properly nested. The XML fragment is parsed to create a `XML::LibXML::DocumentFragment <https://metacpan.org/pod/XML::LibXML::DocumentFragment>`_ which is then appended to the target element: .. literalinclude:: /code/260-dom-modification.pl :language: perl :lines: 54-57 Output: .. literalinclude:: /_output/260-dom-modification.pl-out :language: none :lines: 31-39 One 'gotcha' with the ``appendWellBalancedChunk()`` method is that the XML parsing phase expects a string of bytes. So if you have a Perl string that might contain non-ASCII characters, you first need to encode the character string to a byte string in UTF-8 and then pass the byte string to ``appendWellBalancedChunk()``: .. literalinclude:: /code/260-dom-modification.pl :language: perl :lines: 62-63 Creating a new Document ----------------------- You can create a document from scratch by calling ``XML::LibXML::Document->new()`` rather than parsing from an existing document. Then use the methods discussed above to add elements and text content: .. literalinclude:: /code/270-dom-from-scratch.pl :language: perl :lines: 1-16 In this example, the document encoding was declared as UTF-8 when the Document object was created. Text content was added by calling ``appendText()`` and passing it a normal Perl character string - which happened to contain some non-ASCII characters. When opening the file for output it is not necessary to use an encoding layer since the output from ``libxml`` will already be encoded as utf-8 bytes. The file contents look like this: .. literalinclude:: /_output/270-dom-from-scratch.pl-out :language: none :lines: 1-2 If we hex-dump the file we can see the `e-acute character <http://www.mclean.net.nz/ucf/?c=U+00E9>`_ was written out as the 2-byte UTF-8 sequence ``C3 A9`` and the `euro symbol <http://www.mclean.net.nz/ucf/?c=U+20AC>`_ was written as a 3-byte UTF-8 sequence: ``E2 82 AC``: .. literalinclude:: /_output/270-dom-from-scratch.pl-out :language: none :lines: 4-8 To output the document in a different encoding all you need to do is change the second parameter passed to ``new()`` when creating the Document object. No other code changes are required: .. literalinclude:: /code/271-dom-from-scratch-latin1.pl :language: perl :lines: 10 This time when hex-dumping the file we can see the e-acute character was written out as the single byte ``E9`` and the euro symbol which cannot be represented directly in Latin-1 was written in numeric character entity form ``€``: .. literalinclude:: /_output/271-dom-from-scratch-latin1.pl-out :language: none :lines: 1-6 If you're generating XML from scratch then creating and assembling DOM nodes is very fiddly and ``XML::LibXML`` might not be the best tool for the job. `XML::Generator <https://metacpan.org/pod/XML::Generator>`_ is an excellent module for generating XML - especially if you need to deal with namespaces.