package Text::Refer; =head1 NAME Text::Refer - parse Unix "refer" files I =head1 SYNOPSIS Pull in the module: use Text::Refer; Parse a refer stream from a filehandle: while ($ref = input Text::Refer \*FH) { # ...do stuff with $ref... } defined($ref) or die "error parsing input"; Same, but using a parser object for more control: # Create a new parser: $parser = new Text::Refer::Parser LeadWhite=>'KEEP'; # Parse: while ($ref = $parser->input(\*FH)) { # ...do stuff with $ref... } defined($ref) or die "error parsing input"; Manipulating reference objects, using high-level methods: # Get the title, author, etc.: $title = $ref->title; @authors = $ref->author; # list context $lastAuthor = $ref->author; # scalar context # Set the title and authors: $ref->title("Cyberiad"); $ref->author(["S. Trurl", "C. Klapaucius"]); # arrayref for >1 value! # Delete the abstract: $ref->abstract(undef); Same, using low-level methods: # Get the title, author, etc.: $title = $ref->get('T'); @authors = $ref->get('A'); # list context $lastAuthor = $ref->get('A'); # scalar context # Set the title and authors: $ref->set('T', "Cyberiad"); $ref->set('A', "S. Trurl", "C. Klapaucius"); # Delete the abstract: $ref->set('X'); # sets to empty array of values Output: print $ref->as_string; =head1 DESCRIPTION I This module provides routines for parsing in the contents of "refer"-format bibliographic databases: these are simple text files which contain one or more bibliography records. They are usually found lurking on Unix-like operating systems, with the extension F<.bib>. Each record in a "refer" file describes a single paper, book, or article. Users of nroff/troff often employ such databases when typesetting papers. Even if you don't use *roff, this simple, easily-parsed parameter-value format is still useful for recording/exchanging bibliographic information. With this module, you can easily post-process "refer" files: search them, convert them into LaTeX, whatever. =head2 Example Here's a possible "refer" file with three entries: %T Cyberiad %A Stanislaw Lem %K robot fable %I Harcourt/Brace/Jovanovich %T Invisible Cities %A Italo Calvino %K city fable philosophy %X In this surreal series of fables, Marco Polo tells an aged Kublai Khan of the many cities he has visited in his lifetime. %T Angels and Visitations %A Neil Gaiman %D 1993 The lines separating the records must be I; that is, they cannot contain anything but a single newline. See refer(1) or grefer(1) for more information on "refer" files. =head2 Syntax I:> The bibliographic database is a text file consisting of records separated by one or more blank lines. Within each record fields start with a % at the beginning of a line. Each field has a one character name that immediately follows the %. It is best to use only upper and lower case letters for the names of fields. The name of the field should be followed by exactly one space, and then by the contents of the field. Empty fields are ignored. The conventional meaning of each field is as follows: =over 4 =item A The name of an author. If the name contains a title such as Jr. at the end, it should be separated from the last name by a comma. There can be multiple occurrences of the A field. The order is significant. It is a good idea always to supply an A field or a Q field. =item B For an article that is part of a book, the title of the book =item C The place (city) of publication. =item D The date of publication. The year should be specified in full. If the month is specified, the name rather than the number of the month should be used, but only the first three letters are required. It is a good idea always to supply a D field; if the date is unknown, a value such as "in press" or "unknown" can be used. =item E For an article that is part of a book, the name of an editor of the book. Where the work has editors and no authors, the names of the editors should be given as A fields and , (ed) or , (eds) should be appended to the last author. =item G US Government ordering number. =item I The publisher (issuer). =item J For an article in a journal, the name of the journal. =item K Keywords to be used for searching. =item L Label. B Uniquely identifies the entry. For example, "Able94". =item N Journal issue number. =item O Other information. This is usually printed at the end of the reference. =item P Page number. A range of pages can be specified as m-n. =item Q The name of the author, if the author is not a person. This will only be used if there are no A fields. There can only be one Q field. B Thanks to Mike Zimmerman for clarifying this for me: it means a "corporate" author: when the "author" is listed as an organization such as the UN, or RAND Corporation, or whatever. =item R Technical report number. =item S Series name. =item T Title. For an article in a book or journal, this should be the title of the article. =item V Volume number of the journal or book. =item X Annotation. B Basically, a brief abstract or description. =back For all fields except A and E, if there is more than one occurrence of a particular field in a record, only the last such field will be used. If accent strings are used, they should follow the character to be accented. This means that the AM macro must be used with the -ms macros. Accent strings should not be quoted: use one \ rather than two. =head2 Parsing records from "refer" files You will nearly always use the C constructor to create new instances, and nearly always as shown in the L<"SYNOPSIS">. Internally, the records are parsed by a parser object; if you invoke the class method C, a special default parser is used, and this will be good enough for most tasks. However, for more complex tasks, feel free to use L<"class Text::Refer::Parser"> to build (and use) your own fine-tuned parser, and C from that instead. =head1 CLASS Text::Refer Each instance of this class represents a single record in a "refer" file. =cut use strict; use vars (qw($VERSION $QUIET $GroffFields)); #------------------------------ # # GLOBALS # #------------------------------ # The package version, both in 1.23 style *and* usable by MakeMaker: $VERSION = substr q$Revision: 1.106 $, 10; # Suppress warnings? $QUIET = 0; # Legal fields for different situations: $GroffFields = '[A-EGI-LN-TVX]'; # groff # The default parser: my $Parser = new Text::Refer::Parser; #============================== =head2 Construction and input =over 4 =cut #------------------------------------------------------------ =item new I Build an empty "refer" record. =cut sub new { my $type = shift; bless {}, $type; } #------------------------------------------------------------ =item input FILEHANDLE I Input a new "refer" record from a filehandle. The default parser is used: while ($ref = input Text::Refer \*STDIN) { # ...do stuff with $ref... } Do I use this as an instance method; it will not re-init the object you give it. =cut sub input { shift; $Parser->input(@_); } =back =cut #============================== =head2 Getting/setting attributes =over 4 =cut #------------------------------------------------------------ =item attr ATTR, [VALUE] I Get/set the attribute by its one-character name, ATTR. The VALUE is optional, and may be given in a number of ways: =over 4 =item * B, the attribute will be deleted: $ref->attr('X', undef); # delete the abstract =item * B it is used to replace the existing values for the attribute with that I value: $ref->attr('T', "The Police State Rears Its Ugly Head"); $ref->attr('D', 1997); =item * B it is used to replace the existing values for the attribute with I $ref->attr('A', ["S. Trurl", "C. Klapaucius"]); We use an arrayref since an empty array would be impossible to distinguish from the next two cases, where the goal is to "get" instead of "set"... =back This method returns the current (or new) value of the given attribute, just as C does: =over 4 =item * B context,> the method will return the I value (this is to mimic the behavior of I). Hence, given the above, the code: $author = $ref->attr('A'); will set C<$author> to C<"C. Klapaucius">. =item * B context,> the method will return the list of I values, in order. Hence, given the above, the code: @authors = $ref->attr('A'); will set C<@authors> to C<("S. Trurl", "C. Klapaucius")>. =back I this method is used as the basis of all "named" access methods; hence, the following are equivalent in every way: $ref->attr(T => $title) <=> $ref->title($title); $ref->attr(A => \@authors) <=> $ref->author(\@authors); $ref->attr(D => undef) <=> $ref->date(undef); $auth = $ref->attr('A') <=> $auth = $ref->author; @auths = $ref->attr('A') <=> @auths = $ref->author; =cut sub attr { my ($self, $attr, $values) = @_; if (@_ > 2) { # set the "values"... # undef => empty array # non-arrayref => array of one element # arrayref => that array $values = defined($values) ? $values : []; $self->set($attr, (ref($values) ? @$values : ($values))); } $self->get($attr); } #------------------------------------------------------------ =item author, book, city, ... [VALUE] I For every one of the standard fields in a "refer" record, this module has designated a high-level attribute name: A author G govt_no N number S series B book I publisher O other_info T title C city J journal P page V volume D date K keywords Q corp_author X abstract E editor L label R report_no Then, for each field I with high-level attribute name I, the method C works as follows: $ref->attr('F', @args) <=> $ref->FIELDNAME(@args) Which means: $ref->attr(T => $title) <=> $ref->title($title); $ref->attr(A => \@authors) <=> $ref->author(\@authors); $ref->attr(D => undef) <=> $ref->date(undef); $auth = $ref->attr('A') <=> $auth = $ref->author; @auths = $ref->attr('A') <=> @auths = $ref->author; See the documentation of C for the argument list. =cut sub author { shift->attr('A',@_) } sub book { shift->attr('B',@_) } sub city { shift->attr('C',@_) } sub date { shift->attr('D',@_) } sub editor { shift->attr('E',@_) } sub govt_no { shift->attr('G',@_) } sub publisher { shift->attr('I',@_) } sub journal { shift->attr('J',@_) } sub keywords { shift->attr('K',@_) } sub label { shift->attr('L',@_) } sub number { shift->attr('N',@_) } sub other_info { shift->attr('O',@_) } sub page { shift->attr('P',@_) } sub corp_author { shift->attr('Q',@_) } sub report_no { shift->attr('R',@_) } sub series { shift->attr('S',@_) } sub title { shift->attr('T',@_) } sub volume { shift->attr('V',@_) } sub abstract { shift->attr('X',@_) } #------------------------------------------------------------ =item get ATTR I Get an attribute, by its one-character name. In an array context, it returns all values (empty if none): @authors = $ref->get('A'); # returns list of all authors In a scalar context, it returns the I value (undefined if none): $author = $ref->get('A'); # returns the last author =cut sub get { my ($self, $attr) = @_; my $vals = $self->{$attr} || []; (wantarray ? @$vals : $vals->[-1]); } #------------------------------------------------------------ =item set ATTR, VALUES... I Set an attribute, by its one-character name. $ref->set('A', "S. Trurl", "C. Klapaucius"); An empty array of VALUES deletes the attribute: $ref->set('A'); # deletes all authors No useful return value is currently defined. =cut sub set { my $self = shift; my $attr = shift; if (@_) { $self->{$attr} = [@_] } else { delete $self->{$attr} } 1; } =back =cut #============================== =head2 Output =over 4 =cut #------------------------------------------------------------ # # _wrap STRING # # Split string into lines not exceeding 80 chars in length. my $SMIN = 50; # don't split at nonwords before this position my $SMAX = 75; # max line length sub _wrap { pos($_[0]) = 0; $_[0] =~ s{\G ( # from current position... (.{1,$SMAX})(?:\n|\Z) # next line (if of legal length), plus EOL | # or, (.{$SMIN,$SMAX}\W) # longest prefx of MIN-MAX chars endng in nonword | # or, (.{$SMAX}) # the first MAX chars ) }{ (defined($2) ? $2 : $1) . "\n" # replace with text followed by \n }gexo; chop $_[0] if (substr($_[0], -1, 1) eq "\n"); # get rid of final \n 1; } #-----------------------------------------------------q------- =item as_string [OPTSHASH] I Return the "refer" record as a string, usually for printing: print $ref->as_string; The options are: =over 4 =item Quick If true, do it quickly, but unsafely. I they are output as-is. That means if you used parser-options which destroyed any of the formatting whitespace (e.g., C with C), there is a risk that the output object will be an invalid "refer" record. =back The fields are output with %L first (if it exists), and then the remaining fields in alphabetical order. The following "safety measures" are normally taken: =over 4 =item * Lines longer than 76 characters are wrapped (if possible, at a non-word character a reasonable length in, but there is a chance that they will simply be "split" if no such character is available). =item * Any occurences of '%' immediately after a newline are preceded by a single space. =back These safety measures are slightly time-consuming, and are silly if you are merely outputting a "refer" object which you have read in verbatim (i.e., using the default parser-options) from a valid "refer" file. In these cases, you may want to use the B option. =cut sub as_string { my ($self, %opts) = @_; my ($key, $val); # Figure out the keys to use, and put them in order: my @keys = sort grep {(length == 1) && ($_ ne 'L')} (keys %$self); defined($self->{'L'}) && unshift(@keys, 'L'); # Output: my @lines; foreach $key (@keys) { foreach $val (@{$self->{$key}}) { unless ($opts{Quick}) { ### print "UNWRAPPED = [$val]\n"; _wrap($val); # make sure no line exceeds 80 chars ### print "WRAPPED = [$val]\n"; $val =~ s/\n%/\n %/g; # newlines must NOT be followed by % $val =~ s/\n+\Z//; # strip trailing newlines } push @lines, join('', '%', $key, ' ', $val, "\n"); } } join '', @lines; } =back =cut #============================== # package Text::Refer::Parser; # #============================== =head1 CLASS Text::Refer::Parser Instances of this class do the actual parsing. =head2 Parser options The options you may give to C are as follows: =over 4 =item ForgiveEOF Normally, the last record in a file must end with a blank line, or else this module will suspect it of being incomplete and return an error. However, if you give this option as true, it will allow the last record to be terminated by an EOF. =item GoodFields By default, the parser accepts any (one-character) field name that is a printable ASCII character (no whitespace). Formally, this is: [\041-\176] However, when compiling parser options, you can supply your own regular expression for validating (one-character) field names. (I you must supply the square brackets; they are there to remind you that you should give a well-formed single-character expression). One standard expression is provided for you: $Text::Refer::GroffFields = '[A-EGI-LN-TVX]'; # legal groff fields Illegal fields which are encounterd during parsing result in a syntax error. B You really shouldn't use this unless you absolutely need to. The added regular expression test slows down the parser. =item LeadWhite In many "refer" files, continuation lines (the 2nd, 3rd, etc. lines of a field) are written with leading whitespace, like this: %T Incontrovertible Proof that Pi Equals Three (for Large Values of Three) %A S. Trurl %X The author shows how anyone can use various common household objects to obtain successively less-accurate estimations of pi, until finally arriving at a desired integer approximation, which nearly always is three. This leading whitespace serves two purposes: (1) it makes it impossible to mistake a continuation line for a field, since % can no longer be the first character, and (2) it makes the entries easier to read. The C option controls what is done with this whitespace: KEEP - default; the whitespace is untouched KILLONE - exactly one character of leading whitespace is removed KILLALL - all leading whitespace is removed See the section below on "using the parser options" for hints and warnings. =item Newline The C option controls what is done with the newlines that separate adjacent lines in the same field: KEEP - default; the newlines are kept in the field value TOSPACE - convert each newline to a single space KILL - the newlines are removed See the section below on "using the parser options" for hints and warnings. =back Default values will be used for any options which are left unspecified. =head2 Notes on the parser options The default values for C and C will preserve the input text exactly. The C option, when used in conjunction with the C option, effectively "word-wraps" the text of each field into a single line. B If you use the C option with either the C or the C option, you could end up eliminating all whitespace that separates the word at the end of one line from the word at the beginning of the next line. =head2 Public interface =over 4 =cut use strict; use Carp; #------------------------------------------------------------ sub error { my $self = shift; warn "refer: l.$.: ".join('',@_)."\n" unless $Text::Refer::QUIET; return (wantarray ? () : undef); } #------------------------------------------------------------ =item new PARAMHASH I Create and return a new parser. See above for the L<"parser options"> which you may give in the PARAMHASH. =cut sub new { my ($class, %params) = @_; my $self = \%params; $self->{Class} ||= 'Text::Refer'; $self->{Newline} ||= 'KEEP'; $self->{LeadWhite} ||= 'KEEP'; $self->{GoodFields} ||= '[\041-\176]'; # Compile allowed fields: my $gf = substr($self->{GoodFields}, 1); ($self->{Fields} = join('', map {chr($_)} 0..255)) =~ s{[^$gf}{}g; # The EOL character: if ($self->{Newline} eq 'KILL') { $self->{EOL} = "" } elsif ($self->{Newline} eq 'TOSPACE') { $self->{EOL} = " " } else { $self->{EOL} = "\n" } bless $self, $class; } #------------------------------------------------------------ =item create [CLASS] I What class of objects to create. The default is C. =cut sub create { my ($self, $class) = @_; $self->{Class} = $class if $class; $self->{Class}; } #------------------------------------------------------------ =item input FH I Create a new object from the next record in a "refer" stream. The actual class of the object is given by the C method. Returns the object on success, '0' on I end-of-file, and undefined on error. Having two false values makes parsing very simple: just C records until the result is false, then check to see if that last result was 0 (end of file) or undef (failure). =cut sub input { my ($self, $fh) = @_; my $line; # the next line my $field; # last key read in, or undef local($/) = "\n"; # in case our caller has been naughty # Get options into scalars for faster usage: my $LeadWhite = $self->{LeadWhite}; my $EOL = $self->{EOL}; # Skip blank lines until (legal) EOF or record: while (1) { defined($_ = <$fh>) or return 0; chomp; last if length($_); # break if we hit a nonblank line } # Start new object: my $ref = $self->create->new; $ref->{LineNo} = $.; # Read record lines until (unexpected) EOF or done: while (1) { if (/^%(.)\s?(.*)$/) { # start new field... (index($self->{Fields}, ($field = $1)) >= 0) or return $self->error("bad record field '$field' in <$_>"); push @{$ref->{$field} ||= []}, $2; } elsif (defined($field)) { # add line to previous field... # Muck about with leading whitespace (implicit else is KEEP): if ($LeadWhite eq 'KILLONE') { # kill first leading white: s/^\s//; } elsif ($LeadWhite eq 'KILLALL') { # kill all leading white: s/^\s+//; } # Add separator and new line to existing value: $ref->{$field}[-1] .= ($EOL . $_); } else { # yow! line not inside record! return $self->error("line outside record: <$_>"); } } continue { defined($_ = <$fh>) or do { # unexpected EOF... forgive it? $self->{ForgiveEOF}? last : return $self->error("unexpected EOF")}; chomp; last if ($_ eq ''); # blank line means end of record } # Done! $ref; } =back =cut #------------------------------------------------------------ =head1 NOTES =head2 Under the hood Each "refer" object has instance variables corresponding to the actual field names (C<'T'>, C<'A'>, etc.). Each of these is a reference to an array of the actual values. Notice that, for maximum flexibility and consistency (but at the cost of some space and access-efficiency), the semantics of "refer" records do not come into play at this time: since everything resides in an array, you can have as many %K, %D, etc. records as you like, and given them entirely different semantics. For example, the Library Of Boring Stuff That Everyone Reads (LOBSTER) uses the unused %Y as a "year" field. The parser accomodates this case by politely not choking on LOBSTER .bibs (although why you would want to eat a lobster bib instead of the lobster is beyond me...). =head2 Performance Tolerable. On my 90MHz/32 MB RAM/I586 box running Linux 1.2.13 and Perl5.002, it parses a typical 500 KB "refer" file (of 1600 records) as follows: 8 seconds of user time for input and no output 10 seconds of user time for input and "quick" output 16 seconds of user time for input and "safe" output So, figure the individual speeds are: input: 200 records ( 60 KB) per second. "quick" output: 800 records (240 KB) per second. "safe" output: 200 records ( 60 KB) per second. By contrast, a C program which does the same work is about 8 times as fast. But of course, the C code is 8 times as large, and 8 times as ugly... C<:-)> =head2 Note to serious bib-file users I actually do not use "refer" files for *roffing... I used them as a quick-and-dirty database for WebLib, and that's where this code comes from. If you're a serious user of "refer" files, and this module doesn't do what you need it to, please contact me: I'll add the functionality in. =head1 BUGS Some combinations of parser-options are silly. =head1 CHANGE LOG $Id: Refer.pm,v 1.106 1997/04/22 18:41:41 eryq Exp $ =over 4 =item Version 1.101 Initial release. Adapted from Text::Bib. =back =head1 AUTHOR Copyright (C) 1997 by Eryq, F, F. =head1 NO WARRANTY This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. For a copy of the GNU General Public License, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. =cut 1;