I have already provided many small examples throughout this chapter, but it is a good idea for you to see a full example of working with this extension. While thinking about the best way to demonstrate the use of SAX, I remembered that many DOM parsers are built upon SAX. This example will create a DOM parser using this extension yet leverage the DOM API for the tree creation. I realize this may be pointless since DOM already builds a tree from data, but you could also modify the example with custom objects or containers to create a DOM parser without the use of the DOM extension. This example also utilizes much of the functionality within the xml extension, making it an interesting example all around.
Here’s the code:
class cXML extends DOMDocument { private $currentNode = NULL;
public $separator = ":";
public function __construct() { parent::__construct();
$this->currentNode = $this;
}
function startElement($parser, $data, $attrs) { try {
$nsElement = explode($this->separator, $data);
if (count($nsElement) > 1) {
$uri = array_shift($nsElement);
$name = implode($this->separator, $nsElement);
$node = $this->createElementNS($uri, $name);
} else {
$node = $this->createElement($data);
}
$this->currentNode = $this->currentNode->appendChild($node);
foreach ($attrs AS $name=>$value) {
$nsAttribute = explode($this->separator, $name);
if (count($nsAttribute) > 1) {
$uri = array_shift($nsAttribute);
$name = implode($this->separator, $nsAttribute);
$node = $this->currentNode->setAttributeNS($uri, $name, $value);
} else {
$this->currentNode->setAttribute($name, $value);
} }
} catch (DOMException $e) { throw $e;
} }
function endElement($parser, $data) {
$this->currentNode = $this->currentNode->parentNode;
}
function characterData($parser, $data) { try {
$this->currentNode->appendChild(new DOMText($data));
} catch (DOMException $e) { throw $e;
} }
function PIHandler($parser, $target, $data) {
$node = $this->createProcessingInstruction($target, $data);
$this->currentNode->appendChild($node);
} }
The first step is to define the class that will be used to handle the events. In this case, only the class extends the DOMDocumentclass. Not only is it kind of neat to be able to use an extended DOM object within the xml parser, but also in this case since the DOM API is being used to cre- ate the tree, it offers direct access to the DOMDocumentobject within the handler events.
Two properties are first defined. The private $currentNodeproperty is used within the methods to keep a handle on the current element in scope. The public $separatorproperty is used for namespaced documents, so the separator used by the xml parser is known and can be used to extract information. The use of these will become clearer as the methods are bro- ken down.
The constructor sets up the initial environment here. When the object is instantiated, thecurrentNodeproperty needs to be set to point to the instantiated object. At this point, any- thing that happens as a result of parsing the XML data will be performed within the scope of
the DOMDocument. Before looking at the startElement()and endElement()methods, let’s jump down to the PIHandler()method.
PIs are valid prior to the document element, so it is a nice and simple method to start with. The xml parser passes the target and data for a PI to the handler. All that is performed in this method is that a new PI node is created and appended to the node specified by the currentNodeproperty. As I said, this is a simple starting point.
The startElement()method is a bit more complex. This example code was created to be able to process and create namespaced documents. This is where the separatorproperty comes into play. The extension prefixes local names with the full namespace separated by a user-definable character. Many namespaces, such as URLs, contain the colon character, so something else will be used. The property just allows the character to be set rather than hard- coded into the class definition, allowing for a bit more flexibility.
The first thing the startElement()method does is explode the tag name being passed in from the xml parser. As long as the separator in use is not contained in any namespaces, the resulting array will either contain a single value, indicating that the element is not in a name- space, or contain two values indicating that the element, whose local name is now in index 1 of the array, is in a namespace, which I identified by index 0 of the array.
If no namespace exists, a new element node is created normally. If a namespace does exist, the namespace is extracted from the array, and the tag name is built by imploding the new array.
The implode()function is called in the event the separator character being used also is part of the tag name. The local name for the element would need to be put back together. Once the namespace and local name are pieced back together, a namespaced element node is created.
The new node is then appended as a child of the node referenced by the currentNodeproperty, and currentNodeis set to this new element. Once a start tag is encountered, the scope moves down a level into the subtree.
Attributes are then handled next. The xml parser passes an array of name/value pairs to the startElement()method holding all the attributes for the element. Namespaces are handled in the same fashion as elements, and the rest of the code should be easy to dissect. The only differ- ence is how attribute nodes are created and appended, which is out of the scope of this chapter;
you find can more information about this in Chapter 6.
Just like the start tag moving the scope down a level into the subtree, an ending tag will move the scope up one level. The endElement()method just changes the scope to the parent of the node referenced by the currentNodeproperty. Any processing that occurs after an ending tag occurs on the parent of the node that just ended. Any time the startElement()method is called, a corresponding endElement()method will be called. This is even true for empty-element tags.
A tag like <element1 />will issue both the start element event and the end element event.
The last method for this class is the characterData()method. This method will handle the character data events. Anything being handled by this is created as a text node within the tree.
It is currently not possible to determine what type of data it is because this method handles character data, CDATA, and entity references. The text node is just added as a child of the node referenced by the currentNodeproperty.
That defines everything currently within the class. It is not complete and will not work with all documents. For example, prefixes for namespaces are lost, which results in problems because namespaces are being created as default namespaces in every instance. Also, name- spaced attributes will not work correctly either. As mentioned, because of the current state of the character data handler, everything sent there is created as a text node. If you have the desire to do so, expanding upon this example will provide you with some great experience of
working with XML and the xml extension. You have the option to continue using the DOM API, which will also give you more exposure to the DOM extension, or create your own custom tree handling routines, which will allow you to work with a DOM-like API without the need for the DOM extension. The latter has been done before, such as with the XML_Treeclass; you can find this example in the PEAR repository and referenced later in Chapter 13. For example:
$xml_parser = xml_parser_create_ns(NULL, "@");
$objXMLDoc = new cXML();
$objXMLDoc->separator = "@";
xml_set_object($xml_parser, $objXMLDoc);
xml_parser_set_option ($xml_parser, XML_OPTION_CASE_FOLDING, 0);
xml_set_element_handler($xml_parser, "startElement", "endElement");
xml_set_character_data_handler($xml_parser, "characterData");
xml_set_processing_instruction_handler($xml_parser, "PIHandler");
/* The following can be changed to any XML document */
$xmldata = "<root><element1>text</element1><e2>text<e3>more</e3>text</e2></root>";
try {
if (! xml_parse($xml_parser, $xmldata, true)) {
$xmlError = libxml_get_last_error();
var_dump($xmlError);
}
} catch (DOMException $e) { var_dump($e);
}
xml_parser_free($xml_parser);
print $objXMLDoc->saveXML();
The remainder of this example is straightforward. A namespace-aware parser is created using the default encoding and using @for the namespace separator. The object for event han- dling is created, and its separatorproperty is set to the separator used for the parser. The object is then registered with the parser, and its methods are registered to handle events. The case folding option is disabled, leaving the tag names in their native case rather than forcing them all to uppercase.
This example hard-codes an XML document set in the $xmldatavariable. Parsing is then performed all at once. Feel free to try different documents and even stream chunks of the doc- ument; the results should be the same. This example uses try/catchblocks because it uses the DOM extension. The parser will throw exceptions in certain cases, so this just ensures they are caught and handled properly.
To recap, this demonstrates how to use the xml extension in a semi-real-world case; I say this because using the DOM API is pointless unless only certain pieces of the document were actually to be built. It may also help with many of the concepts of XML. If you have little to no
experience using XML, some of these concepts may be new to you. Being able to see the con- struction of XML from both a stream parsing view and a tree-based view makes it a bit easier to understand how you put everything together. This example is far from complete and prone to error when using namespaced documents. Fixing the issues and possibly creating a tree structure without using the DOM API is an exercise I will leave up to you.