package Text::Scraper; use strict; use Carp; our $VERSION = '0.02'; =pod =head1 NAME Text::Scraper - Structured data from (un)structured text =head1 SYNOPSIS use Text::Scraper; use LWP::Simple; use Data::Dumper; # # 1. Get our template and source text # my $tmpl = Text::Scraper->slurp(\*DATA); my $src = get('http://search.cpan.org/recent') || die $!; # # 2. Extract data from source # my $obj = Text::Scraper->new(tmpl => $tmpl); my $data = $obj->scrape($src); # # 3. Do something really neat...(left as excercise) # print "Newest Submission: ", $data->[0]{submissions}[0]{name}, "\n\n"; print "Scraper model:\n", Dumper($obj), "\n\n"; print "Parsed model:\n", Dumper($data) , "\n\n"; __DATA__
=head1 ABSTRACT Text::Scraper provides a fully functional base-class to quickly develop I and other text extraction tools. Programmatically generated text such as dynamic webpages are trivially reversed engineered. Using templates, the programmer is freed from staring at fragile, heavily escaped regular expressions, mapping capture groups to named variables or wrestling with the DOM and badly formed HTML. In addition, extracted data can be hierarchical, which is beyond the capabilities of vanilla regular expressions. Text::Scraper's functionality overlaps some existing CPAN modules - L and L. Text::Scraper is much more lightweight than either and has a more general application domain than the latter. It has no dependencies on other frameworks, modules or design-decisions. On average, Text::Scraper benchmarks around I<250% faster> than Template::Extract - and uses significantly less memory. Unlike both existing modules, Text::Scraper generalizes its functionality to allow the programmer to refine template capture groups beyond C<(.*?)>, fully redefine the template syntax and introduce new template constructs bound to custom classes. =head1 BACKGROUND Using templates is a popular method of seperating visual presentation from programming logic - particularly popular in programs generating dynamic webpages. Text::Scraper reverses this process, using templates to I the data back out of the surrounding presentation. If you are familiar with templating concepts, then the L should be sufficient to get you started. If not, I would recommend reading the documentation for L - a module thats syntax and terminology is very similar to Text::Scraper's. =head1 DESCRIPTION Template Tags are classed as I or I. Like XML, Branches must have an associated closing tag, Leaves must not. By default, Leaf nodes return SCALARs and Branch nodes return ARRAYs of HASHes - each array element mapping to a matched sub-sequence. Blessing or filtering this data is left as an exercise for subclasses. The default syntax is based on the XML preprocessor syntax: and for Branches: ... By default, Tags I be named and any closing tag I include the name of the opening tag it is closing. Attributes have the same syntax as XML attributes - but (similar to Perl regular expressions) can use any non-bracket punctuation character as quotation delimiters: The only attribute acted on by the default tag classes is C - used to refine how the Tag is translated into a regular-expression capture group: This can be used to further filter the parsed data - similar to using grep: Each tag should create I capture group - but it is fine to make the outer group non-capturing: I C<02 July 1979>. =head2 Default Tags The default tags provided by Text::Scraper are typical for basic scraping but can be subclassed for additional functionality. All the default tags are demonstrated in the L: =over 4 =item B Vars represent strings of text in a template. They are instances of C. =item B Stuff tags represent spans of text that are of no interest in the extracted data, but can ease parsing in certain situations. They are instances of C - a subclass of C. =item B Loops represent repeated information in a template and are extracted as an array of hashes. They are instances of C. =item B A conditional region in the template. If not present, the parent scope will contain a false value under the tags name. Otherwise the value will be true and any tags inside the if's scope will be exported to its parent scope also. These are instances of C. =back =head1 User API These methods alone are sufficient for a basic scraping session: =cut my $null = bless \$0, "NULL"; my %protos = (); sub TRACE () {0;} =pod =head2 C<< my $string = Text::Scraper->slurp( STRING|GLOBREF ) >> Static utility method to return either a filename or filehandle as a string =cut sub slurp { my $class = shift; my $file = shift; my $data = undef; local $/ = undef; if(!ref $file){ open my $f, $file or Carp::croak("$class\::slurp: '$file' $!"); $data = <$f>; close $f; } elsif(ref $file eq 'GLOB'){ $data = <$file>; } else{ Carp::croak("$class\::slurp: bad argument '$file'\n"); } return $data; } =pod =head2 C<< my $object = Text::Scraper->new(HASH) >> Returns a new Text::Scraper object. Optional parameters are: =over 4 =item B A template as a string =item B A Text::Scraper::Syntax instance. See L. =back =cut sub new { my $proto = shift; my $class = ref($proto) || $proto; Carp::croak("Bad key/value arguments to $class::new") if @_ % 2; my $self = bless {@_}, $class; $protos{$self} = $proto unless $proto eq $class; $self->parse_attr(delete $self->{attributes}) if exists $self->{attributes}; $self->compile(delete $self->{tmpl}) if $self->{tmpl}; $self->on_create(); return $self; } sub DESTROY { my $self = shift; $self->on_destroy(); delete $protos{$self}; return; } =pod =head2 C<< $obj->compile(STRING) >> Only required for recompilation or if no B parameter is passed to the constructor. =cut sub compile { my $self = shift; my $tmpl = shift; my $syntax = $self->{syntax} || Text::Scraper::Syntax->new(); if($tmpl && $syntax) { $self->{tmpl} = $tmpl; $self->{syntax}= $syntax; $self->{nodes} = []; my $rex_leaf = $syntax->{regex}{leaf}; my $rex_open = $syntax->{regex}{open}; my $rex_close = $syntax->{regex}{close}; my $rex_escape = $syntax->{regex}{escape}; 1 while $tmpl =~ s#$rex_open(?!=$rex_open.*?$rex_close)(.*?)$rex_close#$self->de_branch($1,$2,$3,$7)#sge; 1 while $tmpl =~ s#$rex_leaf#$self->de_leaf($1,$2,$3)#sge; # TODO: Can this third substitution on escape sequences be removed? # May requires double escape on all above regex...slower? $tmpl = $syntax->quote($tmpl); $tmpl =~ s/$rex_escape/$self->{nodes}[$1]->to_regex()/esg; $self->{compiled} = $tmpl; $self->{nodes} = [ grep { $_ != $null } @{$self->{nodes}} ]; } } # # Compile scopes and replace with internal leafs # sub de_branch { my($self, $type, $name, $args, $body) = @_; my $nodes = $self->{nodes}; my $idx = scalar @$nodes; my $types = $self->{syntax}{branches}; Carp::croak("Invalid branch-type '$type'") unless $types->{$type}; my $node = $types->{$type}->new(tmpl => $body, syntax => $self->{syntax}, type => $type, class => $types->{$type}, name => $name, attributes => $args); push @$nodes, $node; return $self->{syntax}->create_internal_leaf_string( $node, $idx ); } # # Insert leafs and branches in correct order (use $null to maintain indexes) # sub de_leaf { my($self,$type,$name,$args) = @_; my $nodes = $self->{nodes}; my $idx = scalar @$nodes; if($type =~ /^\d+$/o){ push @$nodes, splice(@$nodes, $type, 1, $null); } else{ my $types = $self->{syntax}{leaves}; Carp::croak("Invalid leaf-type '$type'") unless $types->{$type}; push @$nodes, $types->{$type}->new(syntax => $self->{syntax}, type => $type, class => $types->{$type}, name => $name, attributes => $args); } return $self->{syntax}->create_escape_string($idx); } # # NB: Prepends '$' to user attributes to seperate from private # sub parse_attr { my $self = shift; my $args = shift; if(defined $args && length $args){ while($args =~ /(\w+)\s*=\s*(\W)(.*?)\2/sg){ $self->{"\$$1"} = $3; } } } =pod =head2 C<< my $data = $obj->scrape(STRING) >> Extract data from STRING based on compiled template. =cut # NB: $parent and $scope arguments are used internally to allow # nodes to modify their parent, such as Text::Scraper::Conditional sub scrape { my ($self, $text, $parent, $scope) = @_; my $tmpl = $self->{compiled}; my $nodes = $self->{nodes}; return $self->on_data($text) if($self->isa('Text::Scraper::Leaf')); Carp::croak("$self->{name}: Cannot scrape without a compiled template!") unless $tmpl; $text =~ s/\s+/ /sg unless $parent; my @matches = ($text =~ /$tmpl/gs); my $symbols = undef; my $returns = []; TRACE && print STDERR "$self matches: ",scalar @matches,"\n"; for(my $i=0; $i<@matches; $i++) { my $mod = $i % scalar @$nodes; my $node = $nodes->[$mod]; my $name = $node->{name}; if($mod==0) { push @$returns, $symbols if $symbols; $symbols = {}; } next if $node->ignore(); $symbols->{$name} = $node->scrape($matches[$i], $self, $symbols); } push @$returns, $symbols if $symbols; return $self->on_data($returns); } =pod =head1 Subclass API Text::Scraper allows its users to define custom tags and bless captured data into custom classes. Because Text::Scraper objects are prototype based, a subclass can both inherit the scraping logic and also encapsulate any particular instance of the scraped data. During template compilation, a single instance of each tag type is created as the I. Its attributes will be related to the tag, any supplied tag attributes, etc. During scraping, each prototype is invoked to scrape the relevent I against its I. =head2 C<< $subclass->on_create() >> General construction callback. Text::Scraper objects are prototype based so overriding the constructor is not recommended. Objects are hash based; any constructor arguments become attributes of the new instance before invoking this method. =head2 C<< $subclass->on_destroy() >> General destruction callback. Text::Scraper uses the DESTROY hook so any custom functionality is best implemented here. =head2 C<< $subclass->on_data(SCALAR) >> This is the subclasses opportunity to bless or otherwise process any parsed data. The return value from C is added to the generated output data-structure. By default these values are just returned unblessed. The SCALAR argument depends on the class of tag. For C subclasses, SCALAR will be the matched text. For C subclasses, SCALAR will be a reference to an array of hashes. Below is an example of two custom tag classes that bless captured data into the same class: package Myleaf; use base "Text::Scraper::Leaf"; sub on_data { my ($self, $match) = @_; return $self->new(value => $match); } package MyBranch; use base "Text::Scraper::Branch"; sub on_data { my ($self, $matches) = @_; @$matches = map { $self->new(%$_) } @$matches; return $matches; } =head2 C<< my $regex = $subclass->to_regex() >> Returns this nodes representation as a regular expression, to be used in a compiled template. If you find yourself using a particular regex attribute a lot, it might be easier to define a custom tag that overloads this method. =head2 C<< my $boolean = $subclass->ignore() >> Returns a boolean value stating whether the parser should ignore the data captured by this object. =head2 C<< $subclass->proto() $subclass->proto(SCALAR) >> Utility method to allow Tag instances to access (attributes of) their prototype. This can be safely called from a prototype object, which just points to itself. =head2 C<< my @children = $subclass->nodes() >> Returns instance data I, including any present conditional data. =cut sub on_data { return $_[1]; } sub on_create { my $self = shift; } sub on_destroy { my $self = shift; } sub to_regex { my $self = shift; return $self->{"\$regex"} || '(.*?)'; } sub ignore { return 0; } sub nodes { my $self = shift; my $for = shift || $self; my $proto = $self->proto(); return @{$self->{nodes}} if($proto == $self); my @vals = (); foreach my $n ( @{$proto->{nodes}} ) { my $val = $for->{$n->{name}}; next unless $val; push @vals, $val; push @vals, $val->nodes($for) if $val->isa('Text::Scraper::Conditional'); } return @vals; } sub proto { my $self = shift; my $attr = shift; my $proto = $protos{$self} || $self; return ($attr == undef) ? $proto : (defined $proto->{"\$$attr"}) ? $proto->{"\$$attr"} : (defined $proto->{$attr}) ? $proto->{$attr} : undef; } # # Inherits all behaviour from Text::Scraper # package Text::Scraper::Branch; our @ISA = ('Text::Scraper'); # # # package Text::Scraper::Leaf; our @ISA = ('Text::Scraper'); # # # package Text::Scraper::Conditional; our @ISA = ('Text::Scraper'); sub scrape { my ($self, $text, $parent, $scope) = @_; my $data = $self->SUPER::scrape($text, $parent, $scope); my $tag; $data = shift @$data; unless($data){ $tag = 0; } else{ $scope->{$_} = $data->{$_} foreach keys %$data; $tag = 1; } return $tag; } sub to_regex { my $self = shift; return $self->SUPER::to_regex()."?"; } package Text::Scraper::Ignorable; our @ISA = ('Text::Scraper::Leaf'); # TODO: Currently ignorables still capture their text... # which makes for a more elegent algorithm over efficiency. sub ignore { 1; } =pod =head2 Defining a custom syntax The two areas of customization are Tag Syntax and Tag Classes. The defaults are encapsulated in the I class. The interested reader is encouraged to copy the source of the default syntax class and play around with changes. All the over-ridable methods begin with B and are fairly self explanatory or well commented. Any new Tag classes should be subclassed from either I, I, I or I. =cut package Text::Scraper::Syntax; # # Map tag types to classes # sub define_class_leaves { return (var => 'Text::Scraper::Leaf', stuff => 'Text::Scraper::Ignorable'); } sub define_class_branches { return (loop => 'Text::Scraper::Branch', if => 'Text::Scraper::Conditional'); } # # Tag Syntax: # TYPE, NAME, ATTRIBUTES, BACKREF, and ESCAPE are special # markers that are substituted with either regular # expressions or values. # sub define_syntax_leaf { ''; } sub define_syntax_branch_open { ''; } sub define_syntax_branch_close { ''; } # # Escape sequences must never appear in input text # sub define_syntax_escape { "$;ESCAPE$;"; } # # BACKREF must be able to match 2 unique identifiers # in nested branch nodes, hence \2\5. If you change # the order of TYPE and NAME, this will need updated. # sub define_backref { '(?:\2|\5)'; } # # The methods below should NOT be overridden in custom Syntax subclasses # sub new { my $class = shift; my $self = bless {}, $class; my $bref = $self->define_backref(); my %tokens = (NAME => '(\w+)',TYPE => '((?:\w+|\d+))', ATTRIBUTES => '(.*?)?', ESCAPE => '(\d+?)', BACKREF => $bref ); my $tokes = join('|', keys %tokens); # Load valid types: $self->{branches} = { $self->define_class_branches() }; $self->{leaves} = { $self->define_class_leaves() }; my $syn = { leaf => $self->define_syntax_leaf(), open => $self->define_syntax_branch_open(), close => $self->define_syntax_branch_close(), escape => $self->define_syntax_escape() }; # Create regexen from syntax: # 'escape' is a special case as it is invoked as a regex AFTER # whole tmpl has been escaped - requiring double "escapation" my $rex = {}; $rex->{$_} = $self->quote($syn->{$_}) foreach keys %$syn; $rex->{escape} = $self->quote(quotemeta($syn->{escape})); # Insert token regexes into compiled regex $_ =~ s/($tokes)/$tokens{$1}/sg foreach values %$rex; $self->{syntax} = $syn; $self->{regex} = $rex; return $self; } # # Compact and escape a template # TODO: needs knowledge of preserver_whitespace options # sub quote { my $self = shift; my $tmpl = shift; $tmpl =~ s/\s+/ /sgo; $tmpl = qr/\Q$tmpl\E/; $tmpl =~ s/\\\s/\\s*/sgo; return $tmpl; } # # Create the syntax specific escaped index (CANNOT clash with template data) # sub create_escape_string { my $self = shift; my $num = shift; my $str = $self->define_syntax_escape(); $str =~ s/ESCAPE/$num/; return $str; } # # Create syntax for an internal leaf referencing an already parsed branch # sub create_internal_leaf_string { my $self = shift; my $node = shift; my $idx = shift; my $str = $self->define_syntax_leaf(); $str =~ s#TYPE#$idx#; $str =~ s#NAME#$node->{name}#; $str =~ s#ATTRIBUTES##; return $str; } =pod =head1 BUGS & CAVEATS Rather than write a slow parser in pure Perl, Text::Scraper farms a lot of the work out to Perl's optimized regular-expression engine. This works well in general but, unfortunately, doesn't allow for a lot of error feedback during scraping. A fair understanding of the pros and cons of using regular expressions in this manner can be beneficial, but is outside the scope of this documentation. L can be indespensible in following the success of your scraping. It can be safely applied to a Text::Scraper instance to analyze the parser's object model, or to the return value from a C invokation to analyze what was parsed. Bug reports and suggestions welcome. =head1 AUTHOR Copyright (C) 2005 Chris McEwan - All rights reserved. Chris McEwan =head1 LICENSE This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself. =cut 1;