The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.
package Search::Tokenizer;
use warnings;
use strict;
use Unicode::CaseFold ();

our $VERSION = '1.01';

sub new {
  my $class = shift;

  # defaults
  my $regex           = qr/\w+/;
  my $lower           = 1;
  my $filter          = undef;
  my $filter_in_place = undef;
  my $stopwords       = undef;

  # parse arguments
  unshift @_, "regex" if @_ == 1; # positional API
  while (my $arg = shift) {
    my $val = shift;
    $arg .= "=>" . (ref($val) || "NOREF");
    for ($arg) {
      /^regex=>Regexp$/         and do { $regex = $val;           last};
      /^lower=>NOREF$/          and do { $lower = !!$val;         last};
      /^filter=>CODE$/          and do { $filter = $val;          last};
      /^filter_in_place=>CODE$/ and do { $filter_in_place = $val; last};
      /^stopwords=>HASH$/       and do { $stopwords = $val;       last};
      die "Invalid option or invalid operand: $arg";
    }
  }

  # check that regex doest not match the empty string
  not "" =~ $regex
    or die "regex $regex matches the empty string: cannot tokenize";

  # return tokenizer factory: closure
  return sub { 
    my $string = shift;
    my $term_index = -1;

    # return tokenizer : additional closure on $string and $term_index
    return sub {

      # get next occurrence of $regex in $string (thanks to the /g flag)
      while ($string =~ /$regex/g) {

        # index of this term within the input string
        $term_index += 1;

        # boundaries for the match
        my ($start, $end) = ($-[0], $+[0]);

        # extract matched substring (more efficient than $&)
        my $term = substr($string, $start, $end-$start);

        # apply filtering and stopwords, if any
        $term = Unicode::CaseFold::fc($term) if $lower;
        $term = $filter->($term)             if $filter;
        $filter_in_place->($term)            if $filter_in_place;
        undef $term            if $stopwords and $stopwords->{$term};

        # if $term was not cancelled by filters above, return it
        if ($term) {
          return wantarray ? ($term, length($term), $start, $end, $term_index)
                           : $term;
        }
      } # otherwise, loop again to extract next term

      # otherwise, no more term in input string, return undef or empty list
      return;
    };
  };
}

sub word {
  __PACKAGE__->new(regex => qr/\w+/, @_);
}

sub word_locale {
  use locale;
  __PACKAGE__->new(regex => qr/\w+/, @_);
}

sub word_unicode {
  __PACKAGE__->new(regex => qr/\p{Word}+/, @_);
}

sub unaccent {
  require Text::Transliterator::Unaccent;
  my %args = @_;
  my $want_lower      = !exists $args{lower} || $args{lower};
  my %unaccenter_args = $want_lower ? () : (upper => 0);
  my $unaccenter = Text::Transliterator::Unaccent->new(%unaccenter_args);
  __PACKAGE__->new(regex           => qr/\p{Word}+/,
                   filter_in_place => $unaccenter,
                   %args);
}


1; # End of Search::Tokenizer


__END__

=head1 NAME

Search::Tokenizer - Decompose a string into tokens (words)

=head1 SYNOPSIS

  # generic usage
  use Search::Tokenizer;
  my $tokenizer = Search::Tokenizer->new(
     regex     => qr/.../,
     filter    => sub { ... },
     stopwords => {word1 => 1, word2 => 1, ... },
     lower     => 1,
   );
  my $iterator = $tokenizer->($string);
  while (my ($term, $len, $start, $end, $index) = $iterator->()) {
    ...
  }

  # usage for DBD::SQLite (with builtin tokenizers: word, word_locale,
  #   word_unicode, unaccent)
  use Search::Tokenizer;
  $dbh->do("CREATE VIRTUAL TABLE t "
          ."  USING fts3(tokenize=perl 'Search::Tokenizer::unaccent')");


=head1 DESCRIPTION

This module builds an iterator function that will progressively
extract terms from a given input string. Terms are defined by a
regular expression (for example C<\w+>).  Term matching relies on the
builtin "global match" operator of Perl (the 'g' flag), and therefore
is quite efficient.

Before being returned to the caller, terms may be filtered by an
auxiliary function, for performing tasks such as stemming or stopword
elimination.

A tokenizer returned from the L<new|/"new"> method is a code
reference, I<not> a regular Perl object. To use the tokenizer, just
call it with a string to parse : this will return another code
reference, which works as an iterator. Each call to the iterator
will return the next term from the string, until the string is exhausted.

This API was explicitly designed for integrating Perl with the
FTS3 fulltext search engine in L<DBD::SQLite>; however, the API
is general enough to be useful for other purposes, which is why
it is published in its own, separate distribution.

=head1 METHODS

=head2 Creating a tokenizer

  my $tokenizer = Search::Tokenizer->new($regex);
  my $tokenizer = Search::Tokenizer->new(%options);

Builds a new tokenizer, returned as a code reference. 
The first syntax with a single Regexp argument is a shorthand
for C<< ->new(regex => $regex) >>. The second syntax, with
named arguments, has the following available options :

=over

=item C<< regex => $regex >>

C<$regex> is a compiled regular expression that
specifies how to match a term; that regular expression should I<not>
match the empty string (otherwise the tokenizer would enter an
infinite loop). The default is C<qr/\w+/>. Here are some examples of more
advanced regexes :

  # take 'locale' into account
  $regex = do {use locale; qr/\w+/}; 

  # rely on Unicode's definition of "word characters"
  $regex = qr/\p{Word}+/;

  # words like "don't", "it's" are treated as a single term
  $regex = qr/\w+(?:'\w+)?/;

  # same thing but also with internal hyphens like "fox-trot"
  $regex = qr/\w+(?:[-']\w+)?/;

=item C<< lower => $bool >>

If true, the term returned by the C<$regex> is 
converted to lowercase (or more precisely: is 
"case-folded" through L<Unicode::CaseFold/fc>).
This option is activated by default.

=item C<< filter => $filter >>

C<$filter> is a reference to a function that may modify or cancel
a term before it is returned to the caller. The filter takes one
single argument (the term) and returns a scalar (the modified term).
If the value returned from the filter is empty, then this term is canceled.

=item C<< filter_in_place => $filter >>

Like C<filter>, except that the filtering function directly
modifies the term in its C<< $_[0] >> argument instead of returning
a new term. This is useful for example when building a filter
from L<Lingua::Stem::Snowball|Lingua::Stem::Snowball>
or from L<Text::Transliterator::Unaccent|Text::Transliterator::Unaccent>.

=item C<< stopwords => $hashref >>

The keys in C<$hashref> are terms to cancel (usually : common terms
for which indexing would consume lots of resources with little 
added value). Values in the hash should evaluate to true.
Lists of stopwords for various languages may be found in
the L<Lingua::StopWords|Lingua::StopWords> module.
Stopwords filtering is applied after the C<filter> or
C<filter_in_place> function (if any).

=back

Whenever a term is canceled through the filter or stopwords options,
the tokenizer does not return that term to the client, but nevertheless
rembembers the canceled position: so for example when tokenizing 
"Once upon a time" with

 $tokenizer = Search::Tokenizer->new(
    stopwords => Lingua::StopWords::getStopWords('en')
 );

we get the term sequence

  ("upon", 4,  5,  9, 1)
  ("time", 4, 12, 16, 3)

where terms "once" and "a" in positions 0 and 2 have been canceled.

=head2 Creating an iterator

  my $iterator = $tokenizer->($text);

  # loop over terms ..
  while (my $term = $iterator->()) { 
    work_with_term($term); 
  }

  # .. or loop over terms with detailed information
  while (my @term_details = $iterator->()) { 
    work_with_details(@term_details); # ($term, $len, $start, $end, $index) 
  }

The tokenizer takes one string argument and returns an iterator.  The
iterator takes no argument; each call returns a next term from the
string, until the string is exhausted, at which point the iterator
returns an empty result.

If called in a scalar context, the iterator returns just a string; if
called in a list context, it returns a tuple composed from

=over

=item $term

the term (after filtering)

=item $len

the term length

=item $start

the starting offset in the string where this term was found

=item $end

the end offset (where the search for the next term will start)

=item $index

the index of this term within the string, starting at 0

=back

Length and start/end offsets are computed in characters, not in bytes
(note for SQLite users : the C layer in SQLite needs byte values, but
the conversion will be automatically taken care of by the C
implementation in L<DBD::SQLite>).

Beware that ($end - $start) is the length of the original
term extracted by the regex, while $len is the length
of the final $term, after filtering; both
may differ, especially if stemming is being applied.

=head1 BUILTIN TOKENIZERS

For convenience, the following tokenizers are builtin :

=over

=item C<Search::Tokenizer::word>

Terms are "words" according to Perl's notion of C<\w+>.

=item C<Search::Tokenizer::word_locale>

Terms are "words" according to Perl's notion of C<\w+>
under C<use locale>.

=item C<Search::Tokenizer::word_unicode>

Terms are "words" according to Unicode's notion of 
C<\p{Word}+>.

=item C<Search::Tokenizer::unaccent>

Like C<Search::Tokenizer::word_unicode>, but filtered
through L<Text::Transliterator::Unaccent|Text::Transliterator::Unaccent>
to replace all accented characters by their base character.

=back


These builtin tokenizers may take the same arguments
as C<new()>: for example

  use Search::Tokenizer;
  my $tokenizer = Search::Tokenizer::unaccent(lower => 0, stopwords => ...);

=head1 SEE ALSO

=over

=item *

Other tokenizers on CPAN : 
L<KinoSearch::Analysis::Tokenizer|KinoSearch::Analysis::Tokenizer>
and
L<Search::Tools::Tokenizer|Search::Tools::Tokenizer>.

=item * 

Stopwords :
L<Lingua::StopWords|Lingua::StopWords>

=item *

Stemming :
L<Lingua::Stem::Snowball|Lingua::Stem::Snowball>

=item *

Removing accented characters : 
L<Text::Transliterator::Unaccent|Text::Transliterator::Unaccent>

=back


=head1 AUTHOR

Laurent Dami, C<< <lau.....da..@justice.ge.ch> >>

=head1 BUGS

Please report any bugs or feature requests to C<bug-search-tokenizer
at rt.cpan.org>, or through the web interface at
L<http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Search-Tokenizer>.  I
will be notified, and then you'll automatically be notified of
progress on your bug as I make changes.


=head1 SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc Search::Tokenizer


You can also look for information at:

=over 4

=item * RT: CPAN's request tracker

L<http://rt.cpan.org/NoAuth/Bugs.html?Dist=Search-Tokenizer>

=item * AnnoCPAN: Annotated CPAN documentation

L<http://annocpan.org/dist/Search-Tokenizer>

=item * CPAN Ratings

L<http://cpanratings.perl.org/d/Search-Tokenizer>

=item * Search CPAN

L<http://search.cpan.org/dist/Search-Tokenizer/>

=back


=head1 LICENSE AND COPYRIGHT

Copyright 2010 Laurent Dami.

This program is free software; you can redistribute it and/or modify it
under the terms of either: the GNU General Public License as published
by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.


=cut