Bob McWhirter

Subscribe to Bob McWhirter: eMailAlertsEmail Alerts
Get Bob McWhirter: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Related Topics: Apache Web Server Journal, XML Magazine

Apache Web Server: Article

Introduction to SAX Path and Jaxen

Introduction to SAX Path and Jaxen

The W3C (http://w3c.org/) defined XML as a data model. Soon thereafter, work was started to define XPath, the language for addressing parts of an XML document. XPath isn't a technology for performing queries upon an XML document - that's the realm of the XQuery specification - but rather a simpler method for addressing or matching parts of a document.

In the world of XML document usage there's a clear separation between the parser, which can recognize XML tokens and validate to some extent the semantics of the document (using a DTD or a schema), and the application, which uses the data. Interfaces such as the W3C DOM provide an object tree-based representation of the document to the application programmer, whereas the SAX (the Simple API for XML) provides a callback method for interacting with a document in a top-to-bottom event-stream method.

With SAX, events are fired at the start of each <tag>, for any character data or nested tags contained therein, and at the end of each <tag>. Additionally, events are signaled for other entities, such as processing instructions and comments.

While XML parsing technology has come a long way, providing multiple methods for interacting with a document, the same can't be said for other, related technologies, like XPath. Typically, the parsing of an XPath expression is closely linked to the application using it, typically an XSLT processor. But as XPath becomes embedded in more and more XML specifications, such as C14N (XML Canonicalization) and XPointer, having a clean separation between parsing and using XPath becomes more valuable. This is the aim of the SAXPath project. Additionally, with the proliferation of different specialty object models for representing XML documents, each one currently requires its own implementations of an XPath and XSLT engine. The aim of the Jaxen project is to act as a buffer between XPath expressions and object models, allowing a single XPath implementation to operate on many models. Jaxen is one of the first projects to be built on the SAXPath API.

SAXPath
SAXPath is modeled closely on the structure used by David Megginson with SAX. The two most commonly used interfaces are org.saxpath.XPathReader and org.saxpath.XPathHandler.

Applications that wish to handle the parse events must implement the XPathHandler interface, which receives events from a parser that implements the XPathReader interface. More generally, the XPathReader interface extends the SAXPathEventSource, which allows SAXPath events to come from basically any source and not directly from a string parser. Any parser that correctly implements the XPathReader interface should be able to plug and play with your application seamlessly.

By default, the com.werken.saxpath.XPathReader parser is included with the SAXPath distribution, though any other available ones may be used.

Getting Started
To use SAXPath, you must first be able to instantiate a parser. This is done with the aid of a helper class, which instantiates a parser either directly, through a class name parameter, or by using a Java property or the default implementation.

First you must import the necessary helper class, the XPathReader interface, and the exception class:

import org.saxpath.XPathReader;
import org.saxpath.SAXPathException;
import org.saxpath.helpers.XPathReaderFactory;
Next, use one of the createReader() methods. To instantiate, use either the default implementation or, based on the class name provided, the org.saxpath.driver system property - whichever is the simplest form (see Listing 1).

This method examines the org.saxpath.driver property for a fully qualified class name (such as com.werken.saxpath.XPathReader). If that property isn't set, the default parser class name is used.

If you wish to have direct control from your application, simply use the flavor of createReader() that takes a String class name parameter:

XPathReader reader = XPathReaderFactory.createReader
("com.werken.saxpath.XPathReader");
Before you can do anything useful with the new XPathReader, you must register an XPathHandler implementation with it.

XPathHandler handler = new MyXPathHandler();
reader.setXPathHandler( h2andler );
All that's required now to receive parse events from the XPathReader is to pass in an XPath expression for parsing, potentially catching required exceptions (see Listing 2).

Your XPathHandler will now receive events matching a recursive-descent parse of the XPath expression.

For more information about working directly with the SAXPath parse events, go to http://saxpath.org/.

Jaxen
Motivation

I initially created werken.xpath to be an XPath engine to support Jason Hunter and Brett McLaughlin's JDOM project. When James Strachan created dom4j, he also ported the werken.xpath library to his new object model. This was effectively a fork of the original werken.xpath code, and created maintenance headaches as it became difficult to migrate bug fixes between the dom4j and JDOM versions.

Upon further inspection, James and I realized that XPaths are specified only in terms of the XML InfoSet for retrieving and navigating an XML document. With this realization, Jaxen was created as an XPath engine that works with object model adapters that provide a uniform InfoSet-centric view of any object model. Through this, a single core XPath engine could be maintained for many models, with only a thin adapter required.

Getting Started
While Jaxen uses SAXPath under the covers to parse XPath expressions, this fact remains mostly hidden to the user. Each model supported by Jaxen has its own package (such as org.jaxen.dom4j.* or org.jaxen.jdom.*). Once you determine the correct package to use for your model, all other code is identical. To use a different model, simply adjust the import statements as needed. To use Jaxen with dom4j documents, the required import statement is:

import org.jaxen.dom4j.XPath;

For JDOM, you'd use:

import org.jaxen.jdom.XPath;

In either case, all other code is the same (see Listing 3).

XPath objects are fully reentrant and thread-safe. They contain no internal state for evaluation and thus can be cached easily and shared within an application. Once you have an XPath object, you can apply it against various initial contexts and retrieve results in several different ways:

  • You can select a single node (which selects only the first matching node for the given expression):

    xpath.selectSingleNode ( initialContextObject );

  • You can select all matching nodes:

    xpath.selectNodes ( initialContextObject );

  • You can select a Number interpretation of the expression:

    xpath.numberValueOf ( initialContextObject );

  • You can select a simple String value interpretation of the expression:

    xpath.valueOf( initialContextObject );

    Abstracting Away the Object Model
    Jaxen was deliberately designed to be flexible, open, and useful for many purposes. Through the use of the Adapter pattern, implemented through the Navigator interface, virtually any object model can be accommodated. The interface that Jaxen needs from an object model is basically that of the W3C-InfoSet specification. Thus Navigator has methods corresponding to many aspects of the InfoSet. Through the use of Java's own Iterator pattern, which provides access to the various XPath axes, Jaxen causes little impact on the performance of each model. It would be awkward and inefficient to require collections to be expressed as a specific type, such as java.util.List, because some models, such as DOM, have their own collection objects, such as NodeList. By requiring only the much simpler contract provided by a read-only Iterator, Jaxen doesn't introduce additional inefficiencies.

    The Navigator mechanism, along with the already developed implementations, could also be useful in other applications. Apache-Xalan, an XSLT processor, for example, currently supports DOM trees handily. Other types may be accommodated by using implementations of javax.xml.transform.sax.SAXSource, but not natively. It should be possible to rework Xalan to use the Navigator interface and support many models natively.

    As already mentioned, using the SAXPath event API maintains loose coupling between the parsing and evaluation components, which increases reusability.

    Future Directions
    James Strachan is currently working on a new project, betwixt, which will be able to provide an XML representation of arbitrary JavaBeans. Additionally, he'll be providing an implementation of Navigator that will use betwixt to allow XPath expression evaluation on JavaBeans. Interest has been expressed for a Jaxen-based XQL engine.

    Jaxen can certainly be integrated into many existing applications that need only lightweight XPath evaluation and not an entire XSLT engine. For example, David Megginson's NewsML Toolkit uses Jaxen on DOM trees.

    Since both SAXPath and Jaxen use a flavor of the BSD license, developers are free to use them in both open-source and commercial projects without limitation.

    Getting Involved
    Both SAXPath and Jaxen are open to contributors, and are hosted at SourceForge (http://sourceforge.net/). Jaxen, particularly, needs users of different, currently unsupported, object models to create the DocumentNavigator buffer to allow for full XPath support. The SAXPath API could easily be retargeted to other languages, such as C++ or Python.

    Resources

  • SAXPath Web site: http://saxpath.org/
  • SAXPath SourceForge project: http://sourceforge.net/projects/saxpath
  • Jaxen Web site: http://jaxen.org/
  • Jaxen SourceForge project: http://sourceforge.net/projects/jaxen/
  • dom4j Web site: http://dom4j.org/
  • JDOM Web site: http://jdom.org/
  • EXML Web site: http://themindelectric.com/
  • W3C XPath Specification: www.w3.org/TR/xpath
  • More Stories By Bob McWhirter

    Bob McWhirter is an open-source developer
    who has created and contributed to several
    open-source projects, including ANTLR,
    jakarta-velocity, werken.opt, werken.xpath,
    SAXPath, and Jaxen. He's a member of the
    JDOM JSR Expert Group.

    Comments (0)

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.