package Text::SpeedyFx; # ABSTRACT: tokenize/hash large amount of strings efficiently use strict; use utf8; use warnings; use base q(Exporter); our $VERSION = '0.010'; # VERSION require XSLoader; XSLoader::load('Text::SpeedyFx', $VERSION); 1; __END__ =pod =encoding utf8 =head1 NAME Text::SpeedyFx - tokenize/hash large amount of strings efficiently =head1 VERSION version 0.010 =head1 SYNOPSIS use Data::Dumper; use Text::SpeedyFx; my $sfx = Text::SpeedyFx->new; my $words_bag = $sfx->hash('To be or not to be?'); print Dumper $words_bag; #$VAR1 = { # '1422534433' => '1', # '4120516737' => '2', # '1439817409' => '2', # '3087870273' => '1' # }; my $feature_vector = $sfx->hash_fv("thats the question", 8); print unpack('b*', $feature_vector); # 01001000 =head1 DESCRIPTION XS implementation of a very fast combined parser/hasher which works well on a variety of I problems. L is in Java and was adapted for a better Unicode compliance. =head1 METHODS =head2 new([$seed, $bits]) Initialize parser/hasher, can be customized with the options: =over 4 =item C<$seed> Hash seed (default: 1). =item C<$bits> How many bits do represent one character. The default value, 8, sacrifices Unicode handling but is fast and low on memory footprint. The value of 18 encompasses I, I and I planes. See also L =back =head2 hash($octets) Parses C<$octets> and returns a hash reference (not exactly; see L) where keys are the hashed tokens and values are their respective count. C<$octets> are assumed to represent UTF-8 string unless L is instantiated with L == 8 (which forces Latin-1 mode, see L). Note that this is the slowest form due to the (computational) complexity of the L data structure itself: C/C variants are up to 260% faster! =head2 hash_fv($octets, $n) Parses C<$octets> and returns a feature vector (string of bits) with length C<$n>. C<$n> is supposed to be a multiplier of 8, as the length of the resulting feature vector is C. See the included utilities L and L. =head2 hash_min($octets) Parses C<$octets> and returns the hash with the lowest value. Useful in L implementation. See also the included L utility. =head1 UNICODE SUPPORT Due to the nature of Perl, Unicode support is handled differently from the original implementation. By default, L recognizes UTF-8 encoded code points in the range I<00000-2FFFF>: =over 4 =item * B, the B (BMP, I<0000–FFFF>) =item * B, the B (SMP, I<10000–1FFFF>) =item * B, the B (SIP, I<20000–2FFFF>) =item * There are planes up to 16; however, as in Perl v5.16.2, there are no code points matching C there (so it's irrelevant for proper algorithm operation). =back Although, there is a major drawback: in this mode, B allocates up to 1 MB of memory. If the application doesn't need to support code points beyond the B (like the original SpeedyFx implementation) it is possible to constraint the address space to 16 bits, which lowers memory allocation to up to 256 KB. In fact, L constructor accepts bit range between 8 and 18 to address code points. =head2 LATIN-1 SUPPORT 8 bit address space has one special meaning: it completely disables multibyte support. In 8 bit mode, each instance will only allocate 256 bytes and hashing will run up to 340% faster! Tokenization will fallback to I character definitions. =head1 BENCHMARK The test platform configuration: =over 4 =item * Intel® Xeon® E5620 CPU @ 2.40GHz (similar to the one cited in the reference paper); =item * Debian 6.0.6 (Squeeze, 64-bit); =item * Perl v5.16.1 (installed via L); =item * F from the L. =back Rate murmur_utf8 hash_utf8 hash_min_utf8 hash hash_fv hash_min murmur_utf8 6 MB/s -- -79% -86% -89% -97% -97% hash_utf8 30 MB/s 376% -- -35% -47% -84% -85% hash_min_utf8 47 MB/s 637% 55% -- -18% -76% -77% hash 58 MB/s 803% 90% 23% -- -70% -72% hash_fv 194 MB/s 2946% 541% 313% 237% -- -6% hash_min 206 MB/s 3143% 582% 340% 259% 6% -- All the tests except the ones with C<_utf8> suffix were made in L. For comparison, C was implemented using L hasher and native regular expression tokenizer: ++$fv->{murmur_hash(lc $1)} while $data =~ /(\w+)/gx; See also the F script. =head1 CAVEAT For performance reasons, C method returns a L which is an interface to L. While this seems controversal as L interface increases the overhead, the overall result is positive, since the I generation step overrides the slow interface (B). The interesting property of a L is that the keys are "nearly sorted" (and the first key is guaranteed to be the lowest), so: # This: $fv = $sfx->hash($data); ($min) = each %$fv; # Is the same as this: ($min) = $sfx->hash_min($data); # (albeit the later being 2x faster) The hardcoded limit is 524288 unique keys per result, which consumes ~25MB of RAM on a 64-bit architecture. Exceeding this will C with the message I<"too many unique tokens in a single data chunk"> (dynamic allocation not yet supported). The only way to raise this limit is by recompilation of the XS module: perl Makefile.PL DEFINE=-DMAX_TRIE_SIZE=2097152 make make test make install =head1 REFERENCES =over 4 =item * L by L and L =item * L =item * L<Фильтр Блума|http://habrahabr.ru/post/112069/> =item * L, L and L utilities from this distribution =back =head1 AUTHOR Stanislaw Pusep =head1 COPYRIGHT AND LICENSE This software is copyright (c) 2013 by Stanislaw Pusep. This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself. =cut