The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
=head1 NAME

  Text::Document - a text document subject to statistical analysis

=head1 SYNOPSIS

  my $t = Text::Document->new();
  $t->AddContent( 'foo bar baz' );
  $t->AddContent( 'foo barbaz; ' );

  my @freqList = $t->KeywordFrequency();
  my $u = Text::Document->new();
  ...
  my $sj = $t->JaccardSimilarity( $u );
  my $sc = $t->CosineSimilarity( $u );
  my $wsc = $t->WeightedCosineSimilarity( $u, \&MyWeight, $rock );


=head1 DESCRIPTION

C<Text::Document> allows to perform simple
Information-Retrieval-oriented statistics on pure-text documents.

Text can be added in chunks, so that the document may be
incrementally built, for instance by a class like
C<HTML::Parser>.

A simple algorithm splits the text into terms; the algorithm
may be redefined by subclassing and redefining C<ScanV>.

The C<KeywordFrequency> function computes term frequency
over the whole document.

=head1 FORESEEN REUSE

The package may be {re}used either by simple instantiation,
or by subclassing (defining a descendant package).  In the
latter case the methods which are foreseen to be redefined are
those ending with a C<V> suffix.  Redefining other methods
will require greater attention.

=head1 CLASS METHODS

=head2 new

The creator method.  The optional arguments are in the
I<(key,value)> form and allow to specify whether
all keywords are trasformed to lowercase (default) and
whether the string representation (C<WriteToString>)
will be compressed (default).

  my $d = Text::Document->new();
  my $dNotCompressed = Text::Document( compressed => 0 );
  my $dPreserveCase = Text::Document( lowercase => 0 );

=head2 NewFromString

Take a string written by C<WriteToString> (see below)
and create a new C<Text::Document> with the same contents;
call C<die> whenever the restore is impossible or ill-advised,
for instance when the current version of the package is different
from the original one, or the compression library in unavailable.

  my $b = Text::Document::NewFromString( $str );

The return value is a blessed reference; put in another way,
this is an alternative contructor.

The string should have been written by C<WriteToString>; 
you may of course tweak the string contents, but
at this point you're entirely on you own.

=head1 INSTANCE METHODS

=head2 AddContent

Used as

  $d->AddContent( 'foo bar baz foo9' );
  $d->AddContent( 'mary had a little lamb' );

Successive calls accumulate content; there is currently no way
of resetting the content to zero.

=head2 Terms

Returns a list of all distinct terms in the document, in no
particular order.

=head2 Occurrences

Returns the number of occurrences of a given term.

  $d->AddContent( 'foo baz bar foo foo');
  my $n = $d->Occurrences( 'foo' ); # now $n is 3

=head2 ScanV

Scan a string and return a list of terms.

Called internally as:

  my @terms = $self->ScanV( $text );

=head2 KeywordFrequency

Returns a reference list of pairs I<[term,frequency]>, sorted by
ascending frequency.

  my $listRef = $d->KeywordFrequency();
  foreach my $pair (@{$listRef}){
  	my ($term,$frequency) = @{$pair};
	...
  }

Terms in the document are sampled and their frequencies of occurrency
are sorted in ascending order;
finally, the list is returned to the user.

=head2 WriteToString

Convert the document (actually, some parameters
and the term counters) into a string which can be saved and
later restored with C<NewFromString>.

  my $str = $d->WriteToString();

The string begins with a header which encodes the
originating package, its version, the parameters
of the current instance.

Whenever possible, C<Compress::Zlib> is used in order to
compress the bit vector in the most efficient way.
On systems without C<Compress::Zlib>, the bit string is
saved uncompressed.

=head2 JaccardSimilarity

Compute the Jaccard measure of document similarity, which is defined
as follows: given two documents I<D> and I<E>, let I<Ds> and I<Es> be the set
of terms occurring in I<D> and  I<E>, respectively. Define I<S> as the
intersection of I<Ds> and I<Es>, and I<T> as their union. Then
the Jaccerd  similarity is the the number of  elements
of I<S> divided by the number of elements of I<T>.

It is called as follows:

  my $sim = $d->JaccardSimilarity( $e );

If neither document has any terms the result is undef (a rare evenience).
Otherwise the similarity is a real number between 0.0 (no terms in common)
and 1.0 (all terms in common).

=head2 CosineSimilarity

Compute the cosine similarity between two documents I<D> and
I<E>.

Let I<Ds> and I<Es> be the set
of terms occurring in I<D> and  I<E>, respectively. Define I<T> as the
union of I<Ds> and I<Es>, and let I<ti> be the I<i>-th element of I<T>.

Then the term vectors of I<D> and  I<E> are

  Dv = (nD(t1), nD(t2), ..., nD(tN))
  Ev = (nE(t1), nE(t2), ..., nE(tN))

where nD(ti) is the  number of occurrences of term ti in I<D>,
and nE(ti) the same for I<E>.

Now we are at last ready to define the cosine similarity I<CS>:

  CS = (Dv,Ev) / (Norm(Dv)*Norm(Ev))

Here (... , ...) is the scalar product and Norm is the Euclidean
norm (square root of the sum of squares).

C<CosineSimilarity> is called as

   $sim = $d->CosineSimilarity( $e );

It is C<undef> if either I<D> or I<E> have no occurrence of any term.
Otherwise, it is a number between 0.0 and 1.0. Since term occurrences
are always non-negative, the cosine is obviously always non-negative.

=head2 WeightedCosineSimilarity

Compute the weighted cosine similarity between two documents I<D> and
I<E>.

In the setting of C<CosineSimilarity>, the 
term vectors of I<D> and  I<E> are

  Dv = (nD(t1)*w1, nD(t2)*w2, ..., nD(tN)*wN)
  Ev = (nE(t1)*w1, nE(t2)*w2, ..., nE(tN)*wN)

The weights are nonnegative real values; each term has associated
a weight. To achieve generality, weights may be defined
using a function, like:

  my $wcs = $d->WeightedCosineSimilarity(
  	$e,
	\&function,
	$rock
  );

The C<function> will be called as follows:

  my $weight = function( $rock, 'foo' );

C<$rock> is a 'constant' object used for passing a I<context>
to the function.

For instance, a common way of defining weights is the IDF (inverse
document frequency), which is defined in L<Text::DocumentCollection>.
In this context, you can weigh terms with their IDF as
follows:

  $sim = $c->WeightedCosineSimilarity(
  	$d,
	\&Text::DocumentCollection::IDF,
	$collection
  );

C<WeightedCosineSimilarity> will call

  $collection->IDF( 'foo' );

which is what we expect.

Actually, we should return the square root of IDF, but this
detail is not necessary here.

=head1 AUTHORS

  spinellia@acm.org (Andrea Spinelli)
  walter@humans.net (Walter Vannini)

=head1 HISTORY

  2001-11-02 - initial revision

  2001-11-20 - added WeightedCosineSimilarity, suggested by JP Mc Gowan <jp.mcgowan@ucd.ie>

=head DISCARDED CHOICES

We did not use C<Storable>, because we wanted to fine-tune
compression and version compatibility.  However, this
choice may be easily reversed redefining WriteToString and
NewFromString.