=head1 NAME
Text::Document - a text document subject to statistical analysis
=head1 SYNOPSIS
my $t = Text::Document->new();
$t->AddContent( 'foo bar baz' );
$t->AddContent( 'foo barbaz; ' );
my @freqList = $t->KeywordFrequency();
my $u = Text::Document->new();
...
my $sj = $t->JaccardSimilarity( $u );
my $sc = $t->CosineSimilarity( $u );
my $wsc = $t->WeightedCosineSimilarity( $u, \&MyWeight, $rock );
=head1 DESCRIPTION
C allows to perform simple
Information-Retrieval-oriented statistics on pure-text documents.
Text can be added in chunks, so that the document may be
incrementally built, for instance by a class like
C.
A simple algorithm splits the text into terms; the algorithm
may be redefined by subclassing and redefining C.
The C function computes term frequency
over the whole document.
=head1 FORESEEN REUSE
The package may be {re}used either by simple instantiation,
or by subclassing (defining a descendant package). In the
latter case the methods which are foreseen to be redefined are
those ending with a C suffix. Redefining other methods
will require greater attention.
=head1 CLASS METHODS
=head2 new
The creator method. The optional arguments are in the
I<(key,value)> form and allow to specify whether
all keywords are trasformed to lowercase (default) and
whether the string representation (C)
will be compressed (default).
my $d = Text::Document->new();
my $dNotCompressed = Text::Document( compressed => 0 );
my $dPreserveCase = Text::Document( lowercase => 0 );
=head2 NewFromString
Take a string written by C (see below)
and create a new C with the same contents;
call C whenever the restore is impossible or ill-advised,
for instance when the current version of the package is different
from the original one, or the compression library in unavailable.
my $b = Text::Document::NewFromString( $str );
The return value is a blessed reference; put in another way,
this is an alternative contructor.
The string should have been written by C;
you may of course tweak the string contents, but
at this point you're entirely on you own.
=head1 INSTANCE METHODS
=head2 AddContent
Used as
$d->AddContent( 'foo bar baz foo9' );
$d->AddContent( 'mary had a little lamb' );
Successive calls accumulate content; there is currently no way
of resetting the content to zero.
=head2 Terms
Returns a list of all distinct terms in the document, in no
particular order.
=head2 Occurrences
Returns the number of occurrences of a given term.
$d->AddContent( 'foo baz bar foo foo');
my $n = $d->Occurrences( 'foo' ); # now $n is 3
=head2 ScanV
Scan a string and return a list of terms.
Called internally as:
my @terms = $self->ScanV( $text );
=head2 KeywordFrequency
Returns a reference list of pairs I<[term,frequency]>, sorted by
ascending frequency.
my $listRef = $d->KeywordFrequency();
foreach my $pair (@{$listRef}){
my ($term,$frequency) = @{$pair};
...
}
Terms in the document are sampled and their frequencies of occurrency
are sorted in ascending order;
finally, the list is returned to the user.
=head2 WriteToString
Convert the document (actually, some parameters
and the term counters) into a string which can be saved and
later restored with C.
my $str = $d->WriteToString();
The string begins with a header which encodes the
originating package, its version, the parameters
of the current instance.
Whenever possible, C is used in order to
compress the bit vector in the most efficient way.
On systems without C, the bit string is
saved uncompressed.
=head2 JaccardSimilarity
Compute the Jaccard measure of document similarity, which is defined
as follows: given two documents I and I, let I and I be the set
of terms occurring in I and I, respectively. Define I~~ as the
intersection of I and I, and I as their union. Then
the Jaccerd similarity is the the number of elements
of I~~~~ divided by the number of elements of I.
It is called as follows:
my $sim = $d->JaccardSimilarity( $e );
If neither document has any terms the result is undef (a rare evenience).
Otherwise the similarity is a real number between 0.0 (no terms in common)
and 1.0 (all terms in common).
=head2 CosineSimilarity
Compute the cosine similarity between two documents I and
I.
Let I and I be the set
of terms occurring in I and I, respectively. Define I as the
union of I and I, and let I be the I~~*-th element of I.
Then the term vectors of I and I are
Dv = (nD(t1), nD(t2), ..., nD(tN))
Ev = (nE(t1), nE(t2), ..., nE(tN))
where nD(ti) is the number of occurrences of term ti in I,
and nE(ti) the same for I.
Now we are at last ready to define the cosine similarity I:
CS = (Dv,Ev) / (Norm(Dv)*Norm(Ev))
Here (... , ...) is the scalar product and Norm is the Euclidean
norm (square root of the sum of squares).
C is called as
$sim = $d->CosineSimilarity( $e );
It is C if either I or I have no occurrence of any term.
Otherwise, it is a number between 0.0 and 1.0. Since term occurrences
are always non-negative, the cosine is obviously always non-negative.
=head2 WeightedCosineSimilarity
Compute the weighted cosine similarity between two documents I and
I.
In the setting of C, the
term vectors of I and I are
Dv = (nD(t1)*w1, nD(t2)*w2, ..., nD(tN)*wN)
Ev = (nE(t1)*w1, nE(t2)*w2, ..., nE(tN)*wN)
The weights are nonnegative real values; each term has associated
a weight. To achieve generality, weights may be defined
using a function, like:
my $wcs = $d->WeightedCosineSimilarity(
$e,
\&function,
$rock
);
The C will be called as follows:
my $weight = function( $rock, 'foo' );
C<$rock> is a 'constant' object used for passing a I
to the function.
For instance, a common way of defining weights is the IDF (inverse
document frequency), which is defined in L.
In this context, you can weigh terms with their IDF as
follows:
$sim = $c->WeightedCosineSimilarity(
$d,
\&Text::DocumentCollection::IDF,
$collection
);
C will call
$collection->IDF( 'foo' );
which is what we expect.
Actually, we should return the square root of IDF, but this
detail is not necessary here.
=head1 AUTHORS
spinellia@acm.org (Andrea Spinelli)
walter@humans.net (Walter Vannini)
=head1 HISTORY
2001-11-02 - initial revision
2001-11-20 - added WeightedCosineSimilarity, suggested by JP Mc Gowan
=head DISCARDED CHOICES
We did not use C, because we wanted to fine-tune
compression and version compatibility. However, this
choice may be easily reversed redefining WriteToString and
NewFromString.
*