package Search::Glimpse::Index; use warnings; use strict; use Search::Glimpse::ConfigData; use File::Path qw.make_path.; our $VERSION = '0.01'; =encoding UTF-8 =head1 NAME Search::Glimpse::Index - Interface to glimpseindex =head1 SYNOPSIS use Search::Glimpse::Index; my %opt = ( timeindex => 1, dryrun => 0, indexall => 0, indexnum => 0, incremental => 0, structural => 0, destdir => "$ENV{HOME}/myindexes", stopword => 90, # must appear in 90% of files ); my $indexer = Search::Glimpse::Index( %opt ); $indexer->index("/path/to/folder/to/index"); =head1 DESCRIPTION This module is a Perl interface to glimpseindex binary. It (hopefully) makes easier to use the application from within Perl scripts or modules. =head2 Available Methods =over 4 =item C The constructor receives a hash with the indexing options to use. Note that all these values have sensible defaults (mosty, the glimpseindex defaults). Although I describe briefly what each option represent, I suggest to read the complete manpage for C. Known options are: =over 4 =item C (C option) This is the folder where C will store its index files. This is also the path where you should put your exclude/include files. Future versions of this module might include an interface for those files. =item C (C option) This option is a boolean value, and sets whether C should really index the files or just output the files that would be indexed in a real run. =item C (C option) C has three different index sizes. By default the medium index is used (C). Use this option for bigger indexes and (hopefully) faster results. =item C C has three different index sizes. By default the medium index is used (C). Use this option for smaller indexes (not using any C switch). =item C (C option) By default, tokens with digits are not indexed. Therefore, things like C or a date will not be indexed. Use this option to force tokens with digits to be indexed. =item C (C option) Makes C to index all files, independently of their file type. Note that C will honor C<.glimpse_exclude> files. =item C (C option) This option is only available for C version 3.5 or newer. It changes the order by which files are indexed. By default files are indexed in a mostly arbitraty order. With this option (which doesn't work in C mode), the index will store files in a reversed order of modification time (recent files first). Therefore, results of queries are returned by this order, and glimpse is able t filter results by age. =item C (C option) Useful if you have run a C earlier and need to reindex. This option will perform an incremental indexing. If there is no current index or if this procedure fails, glimpseindex automatically reverts to the default mode (which is to index everything from scratch). =item C (C option) Use this option if you want to support structured queries. =item C (C option) This option is used to control the amount of stop words to be considered. For further details on how the values of this option behave, please check C manpage. =back =cut sub new { my $class = shift; my %ops = @_; make_path $ops{destdir} unless -d $ops{destdir}; die "Can't use $ops{destdir}. Permission denied?" unless -d $ops{destdir}; my $self = bless { timeindex => $ops{timeindex} ? "-t" : "", dryrun => $ops{dryrun} ? "-I" : "", indexall => $ops{indexall} ? "-E" : "", indexnum => $ops{indexnum} ? "-n" : "", incremental => $ops{incremental} ? "-f" : "", structural => $ops{structural} ? "-s" : "", destdir => $ops{destdir} ? "-H $ops{destdir}" : "", stopword => $ops{swsize} ? "-S $ops{swsize}" : "", indexsize => $ops{smallindex} ? "" : ($ops{bigindex} ? "-b" : "-o"), bin => Search::Glimpse::ConfigData->config('glimpseindex') } => $class; return $self; } # check how to support .glimpse files # -z Allow customizable filtering, using the file .glimpse_filters to # perform the programs listed there for each match. The best # example is compress/decompress. If .glimpse_filters include the # line # *.Z uncompress < # (separated by tabs) then before indexing any file that matches # the pattern "*.Z" (same syntax as the one for .glimpse_exclude) # the command listed is executed first (assuming input is from # stdin, which is why uncompress needs <) and its output (assuming # it goes to stdout) is indexed. The file itself is not changed # (i.e., it stays compressed). Then if glimpse -z is used, the # same program is used on these files on the fly. Any program can # be used (we run 'exec'). For example, one can filter out parts # of files that should not be indexed. Glimpseindex tries to # apply all filters in .glimpse_filters in the order they are # given. For example, if you want to uncompress a file and then # extract some part of it, put the compression command (the exam- # ple above) first and then another line that specifies the # extraction. Note that this can slow down the search because the # filters need to be run before files are searched. # -B uses a hash table that is 4 times bigger (256k entries instead # of 64K) to speed up indexing. The memory usage will increase # typically by about 2 MB. This option is only for indexing # speed; it does not affect the final index. # -i Make .glimpse_include (SEE GLIMPSEINDEX FILES) take precedence # over .glimpse_exclude, so that, for example, one can exclude # everything (by putting *) and then explicitly include files. # -M x Tells glimpseindex to use x MB of memory for temporary tables. # The more memory you allow the faster glimpseindex will run. The # default is x=2. The value of x must be a positive integer. # Glimpseindex will need more memory than x for other things, and # glimpseindex may perform some 'forks', so you'll have to experi- # ment if you want to use this option. WARNING: If x is too large # you may run out of swap space. # -z Allow customizable filtering, using the file .glimpse_filters to # perform the programs listed there for each match. The best # example is compress/decompress. If .glimpse_filters include the # line # *.Z uncompress < # (separated by tabs) then before indexing any file that matches # the pattern "*.Z" (same syntax as the one for .glimpse_exclude) # the command listed is executed first (assuming input is from # stdin, which is why uncompress needs <) and its output (assuming # it goes to stdout) is indexed. The file itself is not changed # (i.e., it stays compressed). Then if glimpse -z is used, the # same program is used on these files on the fly. Any program can # be used (we run 'exec'). For example, one can filter out parts # of files that should not be indexed. Glimpseindex tries to # apply all filters in .glimpse_filters in the order they are # given. For example, if you want to uncompress a file and then # extract some part of it, put the compression command (the exam- # ple above) first and then another line that specifies the # extraction. Note that this can slow down the search because the # filters need to be run before files are searched. # sub index_files { } # # -F Glimpseindex receives the list of files to index from standard # input. =item C Use with a path to be indexed. =cut sub index { my $self = shift; my $path = shift; my $commandline = join(" ", $self->{bin}, $self->{timeindex}, $self->{dryrun}, $self->{indexall}, $self->{indexnum}, $self->{indexsize}, $self->{stopword}, $self->{destdir}, $self->{structural}, $self->{incremental}, $path); $ENV{LC_ALL} = 'C'; my $output; open PIPE, "-|", $commandline or die "Can't execute glimpseindex"; $output = join("" => ); close PIPE; $self->{output} = $output || ""; return $self; } #sub append { #... -a } #sub delete { -d && -D (force) } ### PROBABLY NOTS # -R Recompute .glimpse_filenames_index from .glimpse_filenames. The # file .glimpse_filenames_index speeds up processing. Glimpsein- # dex usually computes it automatically. However, if for some # reason one wants to change the path names of the files listed in # .glimpse_filenames, then running glimpseindex -R recomputes # .glimpse_filenames_index. This is useful if the index is com- # puted on one machine, but is used on another (with the same # hierarchy). The names of the files listed in .glimpse_filenames # are used in runtime, so changing them can be done at any time in # any way (as long as just the names not the content is changed). # This is not really an option in the regular sense; rather, it # is a program by itself, and it is meant as a post-processing # step. (Avaliable only from version 3.6.) # -w k Glimpseindex does a reasonable, but not a perfect, job of deter- # mining which files should not be indexed. Sometimes a large # text file should not be indexed; for example, a dictionary may # match most queries. The -w option stores in a file called # .glimpse_messages (in the same directory as the index) the list # of all files that contribute at least k new words to the index. # The user can look at this list of files and decide which should # or should not be indexed. The file .glimpse_exclude contains # files that will not be indexed (see more below). We recommend # to set k to about 1000. This is not an exact measure. For # example, if the same file appears twice, then the second copy # will not contribute any new words to the dictionary (but if you # exclude the first copy and index again, the second copy will # contribute). # -X (starting at version 4.0B1) Extract titles from HTML pages and # add the titles to the index (in .glimpse_filenames). (This fea- # ture was added to improve the performance of WebGlimpse.) Works # only on files whose names end with .html, .htm, .shtml, and # .shtm. (see glimpse.h/EXTRACT_INFO_SUFFIX to add to these suf- # fixes.) The routine to extract titles is called extract_info, # in index/filetype.c. This feature can be modified in various # ways to extract info from many filetypes. The titles are # appended to the corresponding filenames with a space separator. # Glimpseindex assumes that filenames don't have spaces in them. =back =head1 SEE ALSO perl(1) =head1 AUTHOR Alberto Manuel Brandão Simões, Eambs@cpan.orgE =head1 COPYRIGHT AND LICENSE Copyright (C) 2011 by Alberto Manuel Brandão Simões =cut !0;