Examples of nux.xom.xquery.StreamingPathFilter

esearch.ibm.com/xj/pubs/icde.pdf">Background Paper, More Papers]. In other words, XQuery and XPath are hard to stream over very large or infinitely long XML inputs without violating some aspects of the W3C specifications. However, subsets of these languages (or simplified cousins) can easily support streaming.

In fact, most use cases dealing with very large XML input documents do not require the full forward and backward navigational capabilities of XQuery and XPath across independent element subtrees. Rather those use cases are record oriented, treating element subtrees (i.e. records) independently, individually selecting/projecting/transforming record after record, one record at a time. For example, consider an XML document with one million records, each describing a published book, music album or web server log entry. A query to find the titles of books that have more than three authors looks at each record individually, hence can easily be streamed. Another use case is splitting a document into several sub-documents based on the content of each record.

More interestingly, consider a P2P XML content messaging router, network transducer, transcoder, proxy or message queue that continuously filters, transforms, routes and dispatches messages from infinitely long streams, with the behaviour defined by deeply inspecting rules (i.e. queries) based on content, network parameters or other metadata. This class provides a convenient solution for such common use cases operating on very large or infinitely long XML input. The solution uses a strongly simplified location path language (which is modelled after XPath but not XPath compliant), in combination with a {@link nu.xom.NodeFactory} andan optional {@link XQuery}. The solution is not necessarily faster than building the full document tree, but it consumes much less main memory.

Here is how it works

You specify a simple "location path" such as /books/book or /weblogs/_2004/_05/entry. The path may contain wildcards and indicates which elements should be retained. All elements not matching the path will be thrown away during parsing. Each retained element is fully build (including its ancestors and descendants) and then made available to the application via a callback to an application-provided {@link StreamingTransform} object.

The StreamingTransform can operate on the fully build element (subtree) in arbitrary ways. For example, it can simply print the element to screen or disk and then forget about it. Or it can add the element (subtree) to the document currently build by the {@link nu.xom.Builder}. In addition, a transform can check conditions such as has book more than three authors? A transform can also replace the element with a different element or a list of arbitrary generated nodes. For example, if a book has more than three authors, just the book title with a authorCount attribute can be added to the document, instead of the entire book element subtree.

Typically, simple StreamingTransforms are formulated in custom Java code, whereas complex ones are formulated as an {@link XQuery}.

Streaming Location Path Syntax

 locationPath := {'/'step}... step := [prefix':']localName   prefix := '*' | '' | XMLNamespacePrefix  localName := '*' | XMLLocalName

A location path consists of zero or more location steps separated by "/". A step consists of an optional XML namespace prefix followed by a local name. The wildcard symbol '*' means: Match anything. An empty prefix ('') means: Match if in no namespace (i.e. null namespace).

Example legal location steps are:

 book       (Match elements named "book" in no namespace)  :book      (Match elements named "book" in no namespace) bib:book   (Match elements named "book" in "bib" namespace) bib:*      (Match elements with any name in "bib" namespace) *:book     (Match elements named "book" in any namespace, including no namespace) *:*        (Match elements with any name in any namespace, including no namespace) :*         (Match elements with any name in no namespace)

Obviously, the location path language is quite simplistic, supporting the "child" axis only. For example, axes such as descendant ("//"), ancestors, following, preceding, as well as predicates and other XPath features are not supported. Typically, this does not matter though, because a full XQuery can still be used on each element (subtree) matching the location path, as follows:

Example Usage

The following is complete and efficient code for parsing and iterating through millions of "person" records in a database-like XML document, printing all residents of "San Francisco", while never allocating more memory than needed to hold one person element:

 StreamingTransform myTransform = new StreamingTransform() { public Nodes transform(Element person) { Nodes results = XQueryUtil.xquery(person, "name[../address/city = 'San Francisco']"); if (results.size() > 0) { System.out.println("name = " + results.get(0).getValue()); } return new Nodes(); // mark current element as subject to garbage collection } }; // parse document with a filtering Builder Builder builder = new Builder(new StreamingPathFilter("/persons/person", null) .createNodeFactory(null, myTransform)); builder.build(new File("/tmp/persons.xml"));

To find the title of all books that have more than three authors and have 'Monterey' and 'Aquarium' somewhere in the title:

 String path = "/books/book"; Map prefixes = new HashMap(); prefixes.put("bib", "http://www.example.org/bookshelve/records"); prefixes.put("xsd", "http://www.w3.org/2001/XMLSchema"); StreamingTransform myTransform = new StreamingTransform() { private Nodes NONE = new Nodes();  // execute XQuery against each element matching location path public Nodes transform(Element subtree) { Nodes results = XQueryUtil.xquery(subtree,  "title[matches(., 'Monterey') and matches(., 'Aquarium') and count(../author) > 3]"); for (int i=0; i < results.size(); i++) { // do something useful with query results; here we just print them System.out.println(XOMUtil.toPrettyXML(results.get(i))); } return NONE; // current subtree becomes subject to garbage collection // returning empty node list removes current subtree from document being build. // returning new Nodes(subtree) retains the current subtree. // returning new Nodes(some other nodes) replaces the current subtree with // some other nodes. // if you want (SAX) parsing to terminate at this point, simply throw an exception  } }; // parse document with a filtering Builder StreamingPathFilter filter = new StreamingPathFilter(path, prefixes); Builder builder = new Builder(filter.createNodeFactory(null, myTransform)); Document doc = builder.build(new File("/tmp/books.xml")); System.out.println("doc.size()=" + doc.getRootElement().getChildElements().size()); System.out.println(XOMUtil.toPrettyXML(doc));

Here is a similar snippet version that takes a filtering Builder from a thread-safe pool with optimized parser configuration:

 ... ... same as above ... final StreamingPathFilter filter = new StreamingPathFilter(path, prefixes); BuilderPool pool = new BuilderPool(100, new BuilderFactory() { protected Builder newBuilder(XMLReader parser, boolean validate) { return new Builder(parser, validate, filter.createNodeFactory(null, myTransform)); } } ); Builder builder = pool.getBuilder(false); Document doc = builder.build(new File("/tmp/books.xml")); System.out.println("doc.size()=" + doc.getRootElement().getChildElements().size());

Applicability

This class is well suited for a P2P XML content messaging router, network transducer, transcoder, proxy or message queue that continuously filters, transforms, routes and dispatches messages from infinitely long streams.

However, this class is less suited for classic database oriented use cases. Here, scalability is limited as the input stream is sequentially scanned, without exploiting the indexing and random access properties typical for (relational) database environments. For such database oriented use cases, consider using the Saxon SQL extensions functions to XQuery, or consider building your own mixed relational/XQuery integration layer, or consider using a database technology with native XQuery support. @author whoschek.AT.lbl.DOT.gov @author $Author: hoschek3 $ @version $Revision: 1.63 $, $Date: 2005/08/12 21:26:30 $