package Text::Scraper;
use strict;
use Carp;
our $VERSION = '0.02';
=pod
=head1 NAME
Text::Scraper - Structured data from (un)structured text
=head1 SYNOPSIS
use Text::Scraper;
use LWP::Simple;
use Data::Dumper;
#
# 1. Get our template and source text
#
my $tmpl = Text::Scraper->slurp(\*DATA);
my $src = get('http://search.cpan.org/recent') || die $!;
#
# 2. Extract data from source
#
my $obj = Text::Scraper->new(tmpl => $tmpl);
my $data = $obj->scrape($src);
#
# 3. Do something really neat...(left as excercise)
#
print "Newest Submission: ", $data->[0]{submissions}[0]{name}, "\n\n";
print "Scraper model:\n", Dumper($obj), "\n\n";
print "Parsed model:\n", Dumper($data) , "\n\n";
__DATA__
=head1 ABSTRACT
Text::Scraper provides a fully functional base-class to quickly develop
I and other text extraction tools. Programmatically
generated text such as dynamic webpages are trivially reversed engineered.
Using templates, the programmer is freed from staring at fragile, heavily
escaped regular expressions, mapping capture groups to named variables or
wrestling with the DOM and badly formed HTML. In addition, extracted data
can be hierarchical, which is beyond the capabilities of vanilla regular
expressions.
Text::Scraper's functionality overlaps some existing CPAN modules -
L and L.
Text::Scraper is much more lightweight than either and has a
more general application domain than the latter. It has no dependencies on
other frameworks, modules or design-decisions. On average, Text::Scraper
benchmarks around I<250% faster> than Template::Extract - and uses
significantly less memory.
Unlike both existing modules, Text::Scraper generalizes its functionality
to allow the programmer to refine template capture groups beyond C<(.*?)>,
fully redefine the template syntax and introduce new template constructs
bound to custom classes.
=head1 BACKGROUND
Using templates is a popular method of seperating visual presentation from
programming logic - particularly popular in programs generating dynamic webpages.
Text::Scraper reverses this process, using templates to I the data
back out of the surrounding presentation.
If you are familiar with templating concepts, then the L should be sufficient
to get you started. If not, I would recommend reading the documentation for
L - a module thats syntax and terminology is very
similar to Text::Scraper's.
=head1 DESCRIPTION
Template Tags are classed as I or I. Like XML, Branches must
have an associated closing tag, Leaves must not. By default, Leaf nodes return
SCALARs and Branch nodes return ARRAYs of HASHes - each array element mapping
to a matched sub-sequence. Blessing or filtering this data is left as an
exercise for subclasses.
The default syntax is based on the XML preprocessor syntax:
and for Branches:
...
By default, Tags I be named and any closing tag I include the name of the
opening tag it is closing. Attributes have the same syntax as XML attributes -
but (similar to Perl regular expressions) can use any non-bracket punctuation character
as quotation delimiters:
The only attribute acted on by the default tag classes is C - used to refine how
the Tag is translated into a regular-expression capture group:
This can be used to further filter the parsed data - similar to using grep:
Each tag should create I capture group - but it is fine to make the outer
group non-capturing:
I C<02 July 1979>.
=head2 Default Tags
The default tags provided by Text::Scraper are typical for basic scraping but can be
subclassed for additional functionality. All the default tags are demonstrated in the
L:
=over 4
=item B
Vars represent strings of text in a template. They are instances of
C.
=item B
Stuff tags represent spans of text that are of no interest in the
extracted data, but can ease parsing in certain situations. They are instances
of C - a subclass of C.
=item B
Loops represent repeated information in a template and are extracted as an
array of hashes. They are instances of C.
=item B
A conditional region in the template. If not present, the parent scope
will contain a false value under the tags name. Otherwise the value will be true
and any tags inside the if's scope will be exported to its parent scope also.
These are instances of C.
=back
=head1 User API
These methods alone are sufficient for a basic scraping session:
=cut
my $null = bless \$0, "NULL";
my %protos = ();
sub TRACE () {0;}
=pod
=head2 C<< my $string = Text::Scraper->slurp( STRING|GLOBREF ) >>
Static utility method to return either a filename or filehandle as a string
=cut
sub slurp
{
my $class = shift;
my $file = shift;
my $data = undef;
local $/ = undef;
if(!ref $file){
open my $f, $file or Carp::croak("$class\::slurp: '$file' $!");
$data = <$f>;
close $f;
}
elsif(ref $file eq 'GLOB'){
$data = <$file>;
}
else{
Carp::croak("$class\::slurp: bad argument '$file'\n");
}
return $data;
}
=pod
=head2 C<< my $object = Text::Scraper->new(HASH) >>
Returns a new Text::Scraper object. Optional parameters are:
=over 4
=item B
A template as a string
=item B
A Text::Scraper::Syntax instance. See L.
=back
=cut
sub new
{
my $proto = shift;
my $class = ref($proto) || $proto;
Carp::croak("Bad key/value arguments to $class::new") if @_ % 2;
my $self = bless {@_}, $class;
$protos{$self} = $proto
unless $proto eq $class;
$self->parse_attr(delete $self->{attributes})
if exists $self->{attributes};
$self->compile(delete $self->{tmpl})
if $self->{tmpl};
$self->on_create();
return $self;
}
sub DESTROY
{
my $self = shift;
$self->on_destroy();
delete $protos{$self};
return;
}
=pod
=head2 C<< $obj->compile(STRING) >>
Only required for recompilation or if no B parameter is passed to the constructor.
=cut
sub compile
{
my $self = shift;
my $tmpl = shift;
my $syntax = $self->{syntax} || Text::Scraper::Syntax->new();
if($tmpl && $syntax)
{
$self->{tmpl} = $tmpl;
$self->{syntax}= $syntax;
$self->{nodes} = [];
my $rex_leaf = $syntax->{regex}{leaf};
my $rex_open = $syntax->{regex}{open};
my $rex_close = $syntax->{regex}{close};
my $rex_escape = $syntax->{regex}{escape};
1 while $tmpl =~ s#$rex_open(?!=$rex_open.*?$rex_close)(.*?)$rex_close#$self->de_branch($1,$2,$3,$7)#sge;
1 while $tmpl =~ s#$rex_leaf#$self->de_leaf($1,$2,$3)#sge;
# TODO: Can this third substitution on escape sequences be removed?
# May requires double escape on all above regex...slower?
$tmpl = $syntax->quote($tmpl);
$tmpl =~ s/$rex_escape/$self->{nodes}[$1]->to_regex()/esg;
$self->{compiled} = $tmpl;
$self->{nodes} = [ grep { $_ != $null } @{$self->{nodes}} ];
}
}
#
# Compile scopes and replace with internal leafs
#
sub de_branch
{
my($self, $type, $name, $args, $body) = @_;
my $nodes = $self->{nodes};
my $idx = scalar @$nodes;
my $types = $self->{syntax}{branches};
Carp::croak("Invalid branch-type '$type'")
unless $types->{$type};
my $node = $types->{$type}->new(tmpl => $body, syntax => $self->{syntax}, type => $type, class => $types->{$type}, name => $name, attributes => $args);
push @$nodes, $node;
return $self->{syntax}->create_internal_leaf_string( $node, $idx );
}
#
# Insert leafs and branches in correct order (use $null to maintain indexes)
#
sub de_leaf
{
my($self,$type,$name,$args) = @_;
my $nodes = $self->{nodes};
my $idx = scalar @$nodes;
if($type =~ /^\d+$/o){
push @$nodes, splice(@$nodes, $type, 1, $null);
}
else{
my $types = $self->{syntax}{leaves};
Carp::croak("Invalid leaf-type '$type'")
unless $types->{$type};
push @$nodes, $types->{$type}->new(syntax => $self->{syntax}, type => $type, class => $types->{$type}, name => $name, attributes => $args);
}
return $self->{syntax}->create_escape_string($idx);
}
#
# NB: Prepends '$' to user attributes to seperate from private
#
sub parse_attr
{
my $self = shift;
my $args = shift;
if(defined $args && length $args){
while($args =~ /(\w+)\s*=\s*(\W)(.*?)\2/sg){
$self->{"\$$1"} = $3;
}
}
}
=pod
=head2 C<< my $data = $obj->scrape(STRING) >>
Extract data from STRING based on compiled template.
=cut
# NB: $parent and $scope arguments are used internally to allow
# nodes to modify their parent, such as Text::Scraper::Conditional
sub scrape
{
my ($self, $text, $parent, $scope) = @_;
my $tmpl = $self->{compiled};
my $nodes = $self->{nodes};
return $self->on_data($text)
if($self->isa('Text::Scraper::Leaf'));
Carp::croak("$self->{name}: Cannot scrape without a compiled template!")
unless $tmpl;
$text =~ s/\s+/ /sg
unless $parent;
my @matches = ($text =~ /$tmpl/gs);
my $symbols = undef;
my $returns = [];
TRACE && print STDERR "$self matches: ",scalar @matches,"\n";
for(my $i=0; $i<@matches; $i++)
{
my $mod = $i % scalar @$nodes;
my $node = $nodes->[$mod];
my $name = $node->{name};
if($mod==0)
{
push @$returns, $symbols if $symbols;
$symbols = {};
}
next if $node->ignore();
$symbols->{$name} = $node->scrape($matches[$i], $self, $symbols);
}
push @$returns, $symbols if $symbols;
return $self->on_data($returns);
}
=pod
=head1 Subclass API
Text::Scraper allows its users to define custom tags and bless captured
data into custom classes. Because Text::Scraper objects are prototype
based, a subclass can both inherit the scraping logic and also encapsulate
any particular instance of the scraped data.
During template compilation, a single instance of each tag type is created
as the I. Its attributes will be related to the tag, any
supplied tag attributes, etc. During scraping, each prototype is invoked
to scrape the relevent I against its I.
=head2 C<< $subclass->on_create() >>
General construction callback. Text::Scraper objects are prototype based so
overriding the constructor is not recommended. Objects are hash based; any
constructor arguments become attributes of the new instance before invoking
this method.
=head2 C<< $subclass->on_destroy() >>
General destruction callback. Text::Scraper uses the DESTROY hook so any
custom functionality is best implemented here.
=head2 C<< $subclass->on_data(SCALAR) >>
This is the subclasses opportunity to bless or otherwise process any parsed
data. The return value from C is added to the generated output
data-structure. By default these values are just returned unblessed.
The SCALAR argument depends on the class of tag. For C
subclasses, SCALAR will be the matched text. For C
subclasses, SCALAR will be a reference to an array of hashes. Below is an
example of two custom tag classes that bless captured data into the same
class:
package Myleaf;
use base "Text::Scraper::Leaf";
sub on_data
{
my ($self, $match) = @_;
return $self->new(value => $match);
}
package MyBranch;
use base "Text::Scraper::Branch";
sub on_data
{
my ($self, $matches) = @_;
@$matches = map { $self->new(%$_) } @$matches;
return $matches;
}
=head2 C<< my $regex = $subclass->to_regex() >>
Returns this nodes representation as a regular expression, to be used
in a compiled template. If you find yourself using a particular regex
attribute a lot, it might be easier to define a custom tag that overloads
this method.
=head2 C<< my $boolean = $subclass->ignore() >>
Returns a boolean value stating whether the parser should ignore the data
captured by this object.
=head2 C<< $subclass->proto() $subclass->proto(SCALAR) >>
Utility method to allow Tag instances to access (attributes of) their prototype.
This can be safely called from a prototype object, which just points to itself.
=head2 C<< my @children = $subclass->nodes() >>
Returns instance data I, including any present conditional data.
=cut
sub on_data
{
return $_[1];
}
sub on_create
{
my $self = shift;
}
sub on_destroy
{
my $self = shift;
}
sub to_regex
{
my $self = shift;
return $self->{"\$regex"} || '(.*?)';
}
sub ignore
{
return 0;
}
sub nodes
{
my $self = shift;
my $for = shift || $self;
my $proto = $self->proto();
return @{$self->{nodes}}
if($proto == $self);
my @vals = ();
foreach my $n ( @{$proto->{nodes}} )
{
my $val = $for->{$n->{name}};
next unless $val;
push @vals, $val;
push @vals, $val->nodes($for)
if $val->isa('Text::Scraper::Conditional');
}
return @vals;
}
sub proto
{
my $self = shift;
my $attr = shift;
my $proto = $protos{$self} || $self;
return ($attr == undef) ? $proto :
(defined $proto->{"\$$attr"}) ? $proto->{"\$$attr"} :
(defined $proto->{$attr}) ? $proto->{$attr} : undef;
}
#
# Inherits all behaviour from Text::Scraper
#
package Text::Scraper::Branch;
our @ISA = ('Text::Scraper');
#
#
#
package Text::Scraper::Leaf;
our @ISA = ('Text::Scraper');
#
#
#
package Text::Scraper::Conditional;
our @ISA = ('Text::Scraper');
sub scrape
{
my ($self, $text, $parent, $scope) = @_;
my $data = $self->SUPER::scrape($text, $parent, $scope);
my $tag;
$data = shift @$data;
unless($data){
$tag = 0;
}
else{
$scope->{$_} = $data->{$_} foreach keys %$data;
$tag = 1;
}
return $tag;
}
sub to_regex
{
my $self = shift;
return $self->SUPER::to_regex()."?";
}
package Text::Scraper::Ignorable;
our @ISA = ('Text::Scraper::Leaf');
# TODO: Currently ignorables still capture their text...
# which makes for a more elegent algorithm over efficiency.
sub ignore
{
1;
}
=pod
=head2 Defining a custom syntax
The two areas of customization are Tag Syntax and Tag Classes. The defaults are
encapsulated in the I class.
The interested reader is encouraged to copy the source of the default syntax class
and play around with changes. All the over-ridable methods begin with B
and are fairly self explanatory or well commented.
Any new Tag classes should be subclassed from either I,
I, I or I.
=cut
package Text::Scraper::Syntax;
#
# Map tag types to classes
#
sub define_class_leaves
{
return (var => 'Text::Scraper::Leaf', stuff => 'Text::Scraper::Ignorable');
}
sub define_class_branches
{
return (loop => 'Text::Scraper::Branch', if => 'Text::Scraper::Conditional');
}
#
# Tag Syntax:
# TYPE, NAME, ATTRIBUTES, BACKREF, and ESCAPE are special
# markers that are substituted with either regular
# expressions or values.
#
sub define_syntax_leaf
{
'';
}
sub define_syntax_branch_open
{
'';
}
sub define_syntax_branch_close
{
'';
}
#
# Escape sequences must never appear in input text
#
sub define_syntax_escape
{
"$;ESCAPE$;";
}
#
# BACKREF must be able to match 2 unique identifiers
# in nested branch nodes, hence \2\5. If you change
# the order of TYPE and NAME, this will need updated.
#
sub define_backref
{
'(?:\2|\5)';
}
#
# The methods below should NOT be overridden in custom Syntax subclasses
#
sub new
{
my $class = shift;
my $self = bless {}, $class;
my $bref = $self->define_backref();
my %tokens = (NAME => '(\w+)',TYPE => '((?:\w+|\d+))', ATTRIBUTES => '(.*?)?', ESCAPE => '(\d+?)', BACKREF => $bref );
my $tokes = join('|', keys %tokens);
# Load valid types:
$self->{branches} = { $self->define_class_branches() };
$self->{leaves} = { $self->define_class_leaves() };
my $syn =
{
leaf => $self->define_syntax_leaf(),
open => $self->define_syntax_branch_open(),
close => $self->define_syntax_branch_close(),
escape => $self->define_syntax_escape()
};
# Create regexen from syntax:
# 'escape' is a special case as it is invoked as a regex AFTER
# whole tmpl has been escaped - requiring double "escapation"
my $rex = {};
$rex->{$_} = $self->quote($syn->{$_}) foreach keys %$syn;
$rex->{escape} = $self->quote(quotemeta($syn->{escape}));
# Insert token regexes into compiled regex
$_ =~ s/($tokes)/$tokens{$1}/sg
foreach values %$rex;
$self->{syntax} = $syn;
$self->{regex} = $rex;
return $self;
}
#
# Compact and escape a template
# TODO: needs knowledge of preserver_whitespace options
#
sub quote
{
my $self = shift;
my $tmpl = shift;
$tmpl =~ s/\s+/ /sgo;
$tmpl = qr/\Q$tmpl\E/;
$tmpl =~ s/\\\s/\\s*/sgo;
return $tmpl;
}
#
# Create the syntax specific escaped index (CANNOT clash with template data)
#
sub create_escape_string
{
my $self = shift;
my $num = shift;
my $str = $self->define_syntax_escape();
$str =~ s/ESCAPE/$num/;
return $str;
}
#
# Create syntax for an internal leaf referencing an already parsed branch
#
sub create_internal_leaf_string
{
my $self = shift;
my $node = shift;
my $idx = shift;
my $str = $self->define_syntax_leaf();
$str =~ s#TYPE#$idx#;
$str =~ s#NAME#$node->{name}#;
$str =~ s#ATTRIBUTES##;
return $str;
}
=pod
=head1 BUGS & CAVEATS
Rather than write a slow parser in pure Perl, Text::Scraper
farms a lot of the work out to Perl's optimized regular-expression engine.
This works well in general but, unfortunately, doesn't allow for a lot of
error feedback during scraping. A fair understanding of the pros and cons
of using regular expressions in this manner can be beneficial, but is outside
the scope of this documentation.
L can be indespensible in following the success of
your scraping. It can be safely applied to a Text::Scraper instance to analyze
the parser's object model, or to the return value from a C invokation
to analyze what was parsed.
Bug reports and suggestions welcome.
=head1 AUTHOR
Copyright (C) 2005 Chris McEwan - All rights reserved.
Chris McEwan
=head1 LICENSE
This program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.
=cut
1;