The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
README for the XML::Essex library distribution
==============================================

    APLHA CODE ALERT: This document promises a bit more than XML::Essex
    delivers at the moment, but not much.  XML::Essex passes about 150
    tests covering all the basics (including thread support, if you have
    a threaded perl) other than that shown in Example 3.

XML::Essex is a combined push/pull XML processing environment.  It
implements 3 SAX processors and a scripting environment:

 XML::Generator::Essex    Generate XML using shorthand sub
 XML::Handler::Essex      Handle XML using pull and matching APIs
 XML::Filter::Essex       Process XML; it's Handler + Generator

 XML::Essex               Write a simple script that uses
                          XML::Filter::Essex under the hood.

These four modules provide:

    - Shorthand event constructors like "start_elt()", "elt()"
    - EventPath (an XPath superset) matching of incoming events
    - Rule based processing: $event_path_expr => sub { ... }
    - Procedural processing: put get while 1;
    - A very perlish SAX oriented DOM
    - Support for large documents using perl threads

Before getting in to the details, here are a couple of examples.  The
first two work; the third should work one day soon.

Example 1: Ad hoc filters
=========================

Here's how to write an ad-hoc filter to count the number of elements
in a file:

    use XML::Handler::Essex;

    my $count = 0;

    my $h =  XML::Handler::Essex->new(

        Main => sub {                     ## the main filter
            while (1) {                   ##
                get "start-element::*";   ##
                ++$count;                 ##
            }                             ##
        },                                ##

    );

    ... feed a document to $h ...

    print $count, "\n";

The incoming SAX stream is consumed by the while(1) loop.  get() reads
only the start_element SAX events, which we ignore, other than to count
them.  When get() runs out of SAX events, it throws an exception that
causes the main filter sub to be exited.  This exception is silently
caught and the SAX pipeline exits normally.

Here it is in an actual script (see the example/ directory):

    use XML::Handler::Essex;
    use XML::SAX::Machines qw( Pipeline );

    my $count = 0;

    Pipeline(

        XML::Handler::Essex->new(

            Main => sub {                     ## the main filter
                while (1) {                   ##
                    get "start-element::*";   ##
                    ++$count;                 ##
                }                             ##
            },                                ##

        ),

    )->parse_file( \*STDIN );

    print $count, "\n";


Example 2: Subclassing.  Oh, and rule based processing.
=======================================================

Here's how to write the above filter as a subclass and also how to use
rule based processing to handle the same task (these two demonstrations
are independant; we could have used rule based processing in Example 1):


    package My::Counter;

    use base 'XML::Filter::Essex';
    use XML::Filter::Essex;  ## Import some helpful items.

    sub main {
        my $count = 0;

        on(
            "start-element::*" => sub { ++$count },
            "end-document::*"  => sub { put [ "count" => $count ] }
        );

        get while 1;
    }

    1;

This can be used like so:

    use My::Counter;
    use XML::SAX::Machines qw( Pipeline );

    Pipeline( My::Counter => \*STDOUT )->parse_file( \*STDIN );


Example 3: As a standalone script
=================================

NOTE: Still working on this.

Here's how to write the above as a standalone script:


    use XML::Essex;

    my $count = 0;

    while (1) {
        get "start_element::*";
        ++$count;
    }

    print $count;

and here's how to do it with rules:


    use XML::Essex;

    my $count = 0;

    on "start-element::*" => sub { ++$count },
       "end-document::*"  => sub { print $count, "\n" };

    get while 1;


The Problem
===========

XML::Essex aims to provide a combined procedural and pattern matching
scripting and SAX programming environments for XML processing.  It uses
SAX as its infrastructure so that Essex may be used in conjunction with
other processing technologies like XSLT, Perl, XML::Generator::PerlData,
XMLDriver::Excel, etc.  This also enables large document support,
although that can require Essex to cache the document or use Perl's
threading support.

One of the difficulties in using SAX to process documents is keeping
state.  In other words, if you want to work with chunks of the document
that aren't right next to each other in the document, you need wait
until the first desirable chunk floats down the SAX stream and save some
kind of value in $self and until the next desirable chunk happens to
float down the event stream.  Lather, rinse, and repeat until all the
desirable chunks have floated by; and you might have to wait for an
end_element or end_document event to be sure you've seen all the
desirable chunks.

Basically, because SAX is totally event driven, you need to catch the
correct events, test them to figure out what they are, and react to them
by modifying some data member in your filter.  XML::Essex lets you keep
state in plain old variables, like $count, above.

Essex also helps you recognize the desirable chunks using EventPath in
on( pattern => action ) rules, also shown above.  And you can mix the
two styles as necessary using closures.

You can also keep some state in $self if need be.


The DOM
=======

To support all this environments, a SAX oriented DOM (XML::Essex::Model)
is provided that encapsulates SAX events and composite SAX events like
elements in objects that provide terse access to data in them.

For instance:

    my $start_elt = get "start_element::*";
    my $elt       = get "*";

    print $start_elt->{$attr_name}, "\n";
    print $elt->{$attr_name}, "\n";
    print $elt->[$node_number], "\n";

prints the named attribute or the content node referred to by
$node_number.

    (This part of Essex is likely to change.  In particular, some
    support for EventPath in place of $elt->{$attr_name}, so
    something like $elt->{"@$attr_name"} might be used instead).


Large Document Support: caching and threading support
=====================================================

For small documents, where small is in relation to a system's available
memory, it's acceptable to let XML::Essex buffer all the events in memory
and then call the main filter when it sees the end_document event.

For larger documents, you can run in a threaded perl and

    use threads;

early on in the main program/script.  XML::Essex will shift gears and
put the main filter in a thread.  For now, all downstream filters also
go in a thread, but this limitation will be addressed.

In the future, XML::Essex will try to use XML::LibXML's push parser
to handle large documents.  This will only be available to XML::Essex
scripts (see Example 3 above) and to XML::Essex::Filter and
XML::Essex::Handler objects that are the first ones in a SAX chain.

Also in the future, XML::Essex will be able to cache large documents on
disk instead of threading.  This approach is needed for two reasons.
The first is that it's unnecesarily limiting to require threading just
to handle large documents.

The second is that Perl's threading support is not designed to handle
this use case efficiently and does a lot of extra work, slowing
XML::Essex down.