NAME
Text::Ngram - Basis for n-gram analysis
SYNOPSIS
use Text::Ngram qw(ngram_counts add_to_counts);
my $text = "abcdefghijklmnop";
my $hash_r = ngram_counts($text, 3); # Window size = 3
# $hash_r => { abc => 1, bcd => 1, ... }
add_to_counts($more_text, 3, $hash_r);
DESCRIPTION
n-Gram analysis is a field in textual analysis which uses sliding window
character sequences in order to aid topic analysis, language
determination and so on. The n-gram spectrum of a document can be used
to compare and filter documents in multiple languages, prepare word
prediction networks, and perform spelling correction.
The neat thing about n-grams, though, is that they're really easy to
determine. For n=3, for instance, we compute the n-gram counts like so:
the cat sat on the mat
--- $counts{"the"}++;
--- $counts{"he "}++;
--- $counts{"e c"}++;
...
This module provides an efficient XS-based implementation of n-gram
spectrum analysis.
There are two functions which can be imported:
$href = ngram_counts($text[, $window]);
This first function returns a hash reference with the n-gram histogram
of the text for the given window size. If the window size is omitted,
then 5-grams are used. This seems relatively standard.
add_to_counts($more_text, $window, $href)
This incrementally adds to the supplied hash; if $window is zero or
undefined, then the window size is computed from the hash keys.
Important note on text preparation
Most of the published algorithms for textual n-gram analysis assume that
the only characters you're interested in are alphabetic characters and
spaces. So before the text is counted, the following preparation is
made.
All characters are lowercased; (most papers use upper-casing, but that
just feels so 1970s) punctuation and numerals are replaced by stop
characters flanked by blanks; multiple spaces are compressed into a
single space.
After the counts are made, n-grams containing stop characters are
dropped from the hash.
If you prefer to do your own text preparation, use the internal routines
"process_text" and "process_text_incrementally" instead of
"count_ngrams" and "add_to_counts" respectively.
SEE ALSO
Cavnar, W. B. (1993). N-gram-based text filtering for TREC-2. In D.
Harman (Ed.), *Proceedings of TREC-2: Text Retrieval Conference 2*.
Washington, DC: National Bureau of Standards.
Shannon, C. E. (1951). Predication and entropy of printed English. *The
Bell System Technical Journal, 30*. 50-64.
Ullmann, J. R. (1977). Binary n-gram technique for automatic correction
of substitution, deletion, insert and reversal errors in words.
*Computer Journal, 20*. 141-147.
SUPPORT
Beep... beep... this is a recorded announcement:
I've released this software because I find it useful, and I hope you
might too. But I am a being of finite time and I'd like to spend more of
it writing cool modules like this and less of it answering email, so
please excuse me if the support isn't as great as you'd like.
Nevertheless, there is a general discussion list for users of all my
modules, to be found at
http://lists.netthink.co.uk/listinfo/module-mayhem
If you have a problem with this module, someone there will probably have
it too.
AUTHOR
Simon Cozens, "simon@cpan.org"
COPYRIGHT AND LICENSE
Copyright 2003 by Simon Cozens
This library is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.