#!/usr/local/bin/perl -w =head1 NAME wordvec.pl - Construct word vectors from bigram or co-occurrence matrices =head1 SYNOPSIS wordvec.pl [OPTIONS] WORD_PAIRS =head1 DESCRIPTION Constructs word vectors from the given WORD_PAIRS. =head1 INPUT =head2 Required Arguments: =head4 WORD_PAIRS WORD_PAIRS should be a bigram or co-occurrence pair file as created by programs count.pl, statistic.pl or combig.pl from the N-gram Statistics package. Various ways to create WORD_PAIRS are - =over =item 1. Run count.pl alone (WORD_PAIRS show bigram frequency counts) =item 2. Run count.pl followed by combig.pl (WORD_PAIRS show co-occurrence pair frequency counts) =item 3. Run count.pl followed by statistic.pl (WORD_PAIRS show test of association scores of bigrams) =item 4. Run count.pl followed by combig.pl followed by statistic.pl (WORD_PAIRS show test of association scores of co-occurrence pairs) =back Cases 1 and 2 will create WORD_PAIRS in format - word1<>word2<>n11 n1p np1 where n11 shows the joint bigram or co-occurrence frequency count Cases 3 and 4 will create WORD_PAIRS in format - word1<>word2<>rank score n11 n1p np1 where 'score' shows the test of association score of a bigram/co-occurrence pair. =head2 Optional Arguments: =head4 --wordorder WORDORD Allows to retain or ignore the order of the words in the WORD_PAIRS. The possible options for the value of --wordorder are - =over 4 =item * nocare Select --wordorder = nocare when WORD_PAIRS do not show any particular order of words. This is applicable only when WORD_PAIRS are created using combig.pl as suggested by cases 2 and 4 in the previous section. This tells wordvec that WORD_PAIRS show the joint co-occurrence scores of the word pairs. With wordorder = nocare, wordvec won't allow word pairs in both orders, meaning, if the pair word1<>word2 appears in the WORD_PAIRS file, pair word2<>word1 won't be allowed. =item * follow [default] Set --wordorder = follow if WORD_PAIRS are bigrams as created with cases 1 and 3 shown in the previous section. For every word pair word1<>word2, word1 will be assigned a single feature index and will represent a row in the output word matrix at that index while word2 will be assigned a single dimension index and will represent a column in the output word matrix at that index. Assumming that word1 is assigned a feature index i and represents ith row and word2 is assigned a dimension index j and represents jth column, the matrix cell at [i][j] will show the frequency of the bigram word1<>word2. =item * precede WORD_PAIRS are bigrams same as in --wordorder = follow, however, for every word pair word1<>word2, word1 is assigned a dimension index and represents a column (as against to representing a row/feature when --wordorder=follow) while word2 is assigned a feature index and represents a row (as against to representing a dimension/column in --wordorder=follow). Assumming that word1 is assigned a dimension index i and represents the ith column, while word2 is assigned a feature index j and represents the jth row, frequency score of bigram word1<>word2 is shown in the matrix cell at [j][i]. Thus, the output word matrix created by --wordorder = precede is a transpose of that created by --wordorder = follow. =back =head4 --binary Creates binary word vectors that show mere presence (by 1) or absence (by 0) of the feature-dimension pairs. By default, wordvec creates frequency vectors that show the frequency scores of the word pairs as given in the WORD_PAIRS file. =head4 --dense Creates dense word vectors. By default, output of wordvec will show sparse word vectors. =head4 --feats FEATFILE Specifies the name of the feature file that lists the words that represent the rows of the output word association matrix. If the FEATFILE exists, words listed in this file define the rows of the output word matrix. Thus, the FEATFILE specifies the feature words for which the word vectors are to be created. If the FEATFILE doesn't exist, it is created by wordvec and shows the words that represent the rows of the output word matrix. =head4 --dims DIMFILE DIMFILE is created by wordvec and reports the words that represent the columns/dimensions of the output word matrix. =head4 --target TARGET_REGEX Specifies a file containing Perl regex/s that define the target word. By default, target.regex file is assumed to exist in the current directory. This is only required if --extarget is selected. =head4 --extarget This will ignore WORD_PAIRS in which either of the constituent words is a target word. Target word can be defined by specifying a target regex file via --target option or by copying target.regex file to current directory. =head4 --format FORM Specifies numeric format for representing each word vector entry. Possible values of FORM are iN -> integer format allocating total N bytes/digits for each entry fN.M -> floating point format allocating total N bytes/digits for each entry of which last M digits show fractional part. When --binary is ON, default format is i2 and otherwise default is f16.10. =head3 Other Options : =head4 --help Displays this message. =head4 --version Displays the version information. =head1 OUTPUT Consider the following illustration - Sample WORD_PAIRS input => stir<>soup<>5 21 64 soup<>plate<>8 70 14 hot<>soup<>12 173 64 hot<>plate<>9 173 29 salt<>pepper<>42 124 121 taste<>salt<>18 83 84 add<>salt<>12 157 84 stir<>lemon<>2 21 53 lemon<>juice<>2 10 2 add<>lemon<>3 157 53 lemon<>pepper<>3 67 120 stir<>juice<>2 21 27 =over =item 1. --wordorder = follow or default Given WORD_PAIRS are treated as bigrams and the order of the words is retained such that the 1st word in the bigrams becomes a feature and is assigned a unique row index while the 2nd word becomes a dimension and is assigned a single column index in the output word matrix. case (1) Feature file provided via --feats FEATFILE doesn't exist Feature file is automatically created by wordvec and lists all the word types that appear as the 1st words in the given bigrams. i.e. - stir<> soup<> hot<> salt<> taste<> add<> lemon<> The dimension file created with --dims option will list all the word types that appear as the 2nd words in the given bigrams. i.e. - soup<> plate<> pepper<> salt<> lemon<> juice<> Thus, the bigrams listed in the given WORD_PAIRS file can be viewed in a matrix form as - soup<> plate<> pepper<> salt<> lemon<> juice<> stir<> 5 0 0 0 2 2 soup<> 0 8 0 0 0 0 hot<> 12 9 0 0 0 0 salt<> 0 0 42 0 0 0 taste<> 0 0 0 18 0 0 add<> 0 0 0 12 3 0 lemon<> 0 0 3 0 0 2 whose rows represent the feature words and columns represent the dimension words. =over =item a. --dense not used By default, the output word matrix is created in sparse format in which the first line shows #rows #cols #nnz i.e. number of rows, number of columns and total number of non-zero entries separated by space. Each line thereafter shows a sparse word vector of the feature shown on the corresponding line in the feature file. A sparse word vector lists pairs of numbers separated by space such that the first number in a pair indicates the column index of a non-zero value and the second number is the value itself that appears at that index. Thus, the output of wordvec for the above example, created with --wordorder = follow or un-specified will be => 7 6 12 1 5 5 2 6 2 2 8 1 12 2 9 3 42 4 18 4 12 5 3 3 3 6 2 where the 1st line "7 6 12" shows that there are total 7 word vectors represented using 6 dimensions with 12 non-zero entries. Each row thereafter indicates a word vector in sparse format. e.g. 2nd line shows the word vector of feature stir<>. This vector has total 3 non-zero values 5, 2, 2 that occur at indices 1, 5, 6 resp. Column index counting starts from 1 to be consistent with Cluto's matrix format. =item b. --dense used When --dense is used, output will show the word matrix in dense format as 7 6 5 0 0 0 2 2 0 8 0 0 0 0 12 9 0 0 0 0 0 0 42 0 0 0 0 0 0 18 0 0 0 0 0 12 3 0 0 0 3 0 0 2 where the first line shows that there are 7 word vectors represented using 6 dimensions. =item c. --binary used When --binary is used, all non-zero bigram scores will be set to 1. Thus, when --dense is used, output will show 7 6 1 0 0 0 1 1 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 1 Otherwise, binary sparse vectors will look like - 7 6 12 1 1 5 1 6 1 2 1 1 1 2 1 3 1 4 1 4 1 5 1 3 1 6 1 =back case (2) Feature file provided via the '--feats FEATFILE' option exists and lists the features for which the vectors are to be created. Suppose the FEATFILE contains - taste<> hot<> lemon<> salt<> Then, for each bigram word1<>word2, if word1 is one of the above words listed in the FEATFILE, a unique row index say i is assigned to word1 and a unique column index say j is assigned to word2. The matrix entry at [i][j] then indicates the score of the bigram word1<>word2. Thus, for the above example, the word matrix can be viewed as - soup plate pepper salt juice taste 0 0 0 18 0 hot 12 9 0 0 0 lemon 0 0 3 0 2 salt 0 0 42 0 0 The dimension file created with --dims option will show the words that represent the columns - soup<> plate<> pepper<> salt<> juice<> The output of wordvec created with --dense option will look as - 4 5 0 0 0 18 0 12 9 0 0 0 0 0 3 0 2 0 0 42 0 0 where, the first line shows that there are 4 features and 5 dimensions. Each line thereafter shows the word vector of the corresponding feature word. The following shows the sparse representation of the same matrix when --dense is not used - 4 5 6 4 18 1 12 2 9 3 3 5 2 3 42 where the first line indicates that there are total 4 features, 5 dimensions and total 6 non-zero entries in the output matrix. Each row after that shows the 'index value' pair for each non-zero entry at that row, where column indices start with 1s. =item 2. --wordorder = precede Order of words in bigram pairs is retained such that 2nd word becomes a feature and represents a row while the 1st word becomes a dimension and represents a column of the output word matrix. The word matrix thus shows the transpose of the bigram matrix created by --wordorder = follow and the cell values show how frequently a dimension word precedes a feature word. The feature file created with --feats option shows the word types that appear as the 2nd words in the given bigrams. i.e. soup<> plate<> pepper<> salt<> lemon<> juice<> while the dimension file created with dims option shows the word types that appear as the 1st words in the bigrams. i.e. stir<> soup<> hot<> salt<> taste<> add<> lemon<> Thus, the word matrix can be seen as stir<> soup<> hot<> salt<> taste<> add<> lemon<> soup<> 5 0 12 0 0 0 0 plate<> 0 8 9 0 0 0 0 pepper<> 0 0 0 42 0 0 3 salt<> 0 0 0 0 18 12 0 lemon<> 2 0 0 0 0 3 0 juice<> 2 0 0 0 0 0 2 When --dense is selected, word vectors displayed on stdout will look as - 6 7 5 0 12 0 0 0 0 0 8 9 0 0 0 0 0 0 0 42 0 0 3 0 0 0 0 18 12 0 2 0 0 0 0 3 0 2 0 0 0 0 0 2 while by default, output will be sparse as shown by - 6 7 12 1 5 3 12 2 8 3 9 4 42 7 3 5 18 6 12 1 2 6 3 1 2 7 2 If the feature file is provided, vectors are created for the given feature words only and dimensions show the words that precede them. =item 3. --wordorder = nocare When wordorder is nocare, given WORD_PAIRS are treated as co-occurrence pairs and the order of words is ignored. case (1) Feature file provided via '--feats FEATFILE' option doesnt exist. In this case, feature and dimension files will be same and will show all unique word types encountered in the WORD_PAIRS file irrespective of the positions of the words. Each word type in WORD_PAIRS is assigned a unique index and represents the row and column of the output word matrix at that index. Thus, the output word co-occurrence matrix is square and symmetric. Feature and dimension files for above example will show => stir<> soup<> plate<> hot<> salt<> pepper<> taste<> add<> lemon<> juice<> while the word matrix can be seen as stir<> soup<> plate<> hot<> salt<> pepper<> taste<> add<> lemon<> juice<> stir<> 0 5 0 0 0 0 0 0 2 2 soup<> 5 0 8 12 0 0 0 0 0 0 plate<> 0 8 0 9 0 0 0 0 0 0 hot<> 0 12 9 0 0 0 0 0 0 0 salt<> 0 0 0 0 0 42 18 12 0 0 pepper<> 0 0 0 0 42 0 0 0 3 0 taste<> 0 0 0 0 18 0 0 0 0 0 add<> 0 0 0 0 12 0 0 0 3 0 lemon<> 2 0 0 0 0 3 0 3 0 2 juice<> 2 0 0 0 0 0 0 0 2 0 Output word matrix shown on stdout will look as => 10 10 24 2 5 9 2 10 2 1 5 3 8 4 12 2 8 4 9 2 12 3 9 6 42 7 18 8 12 5 42 9 3 5 18 5 12 9 3 1 2 6 3 8 3 10 2 1 2 9 2 Or as 10 10 0 5 0 0 0 0 0 0 2 2 5 0 8 12 0 0 0 0 0 0 0 8 0 9 0 0 0 0 0 0 0 12 9 0 0 0 0 0 0 0 0 0 0 0 0 42 18 12 0 0 0 0 0 0 42 0 0 0 3 0 0 0 0 0 18 0 0 0 0 0 0 0 0 0 12 0 0 0 3 0 2 0 0 0 0 3 0 3 0 2 2 0 0 0 0 0 0 0 2 0 when --dense is ON. case (2) Feature file provided via '--feats FEATFILE' exists and lists the feature words for which the vectors are to be created. In this case, the feature and dimension files won't be same, neither the output matrix will be square and symmetric, unless the FEATFILE is exactly same like the one automatically created by wordvec as in case (1) above. For each bigram word1<>word2 that is encountered in the WORD_PAIRS file, we check if word1 is listed in the given FEATFILE. If so, word2 is assigned a unique dimension index say j and the score of the bigram word1<>word2 is assigned to the matrix entry at [i][j], if word1 occurs at the ith position in the given FEATFILE. Then, we check if word2 is listed in the given FEATFILE and if it is and appears at the kth position in the FEATFILE, we assign a unique dimension (column) index say l to word1 and set the matrix entry at [k][l] to the co-occurrence score of the pair word1<>word2. For example, if the FEATFILE contains - soup<> hot<> salt<> lemon<> pepper<> then, the word matrix with --wordorder = nocare can be viewed as - stir plate soup hot pepper salt taste add juice lemon soup<> 5 8 0 12 0 0 0 0 0 0 hot<> 0 9 12 0 0 0 0 0 0 0 salt<> 0 0 0 0 42 0 18 12 0 0 lemon<> 2 0 0 0 3 0 0 3 2 0 pepper<> 0 0 0 0 0 42 0 0 0 3 Output will display only the word matrix as 5 10 5 8 0 12 0 0 0 0 0 0 0 9 12 0 0 0 0 0 0 0 0 0 0 0 42 0 18 12 0 0 2 0 0 0 3 0 0 3 2 0 0 0 0 0 0 42 0 0 0 3 with --dense ON and 5 10 14 1 5 2 8 4 12 2 9 3 12 5 42 7 18 8 12 1 2 5 3 8 3 9 2 6 42 10 3 without --dense The dimension file created with --dims will show - stir<> plate<> soup<> hot<> pepper<> salt<> taste<> add<> juice<> lemon<> =back =head1 SYSTEM REQUIREMENTS =over =item Ngram Statistics Package - L =back =head1 AUTHORS Amruta Purandare, University of Pittsburgh. Ted Pedersen, University of Minnesota, Duluth tpederse at d.umn.edu =head1 COPYRIGHT Copyright (c) 2002-2008, Amruta Purandare and Ted Pedersen This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to The Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. =cut ############################################################################### # THE CODE STARTS HERE ############################################################################### # ================================ # COMMAND LINE OPTIONS AND USAGE # ================================ # command line options use Getopt::Long; GetOptions ("help","version","wordorder=s","binary","dense","dims=s","feats=s","format=s","target=s","extarget"); # show help option if(defined $opt_help) { $opt_help=1; &showhelp(); exit; } # show version information if(defined $opt_version) { $opt_version=1; &showversion(); exit; } # show minimal usage message if less arguments if($#ARGV<0) { &showminimal(); exit; } # default wordorder is follow # word vectors show bigram scores feature<>dimension if(!defined $opt_wordorder) { $opt_wordorder="follow"; } ############################################################################# # ================================ # INITIALIZATION AND INPUT # ================================ #$0 contains the program name along with #the complete path. Extract just the program #name and use in error messages $0=~s/.*\/(.+)/$1/; if($opt_wordorder !~ /(precede)|(follow)|(nocare)/) { print STDERR "ERROR($0): --wordorder must be precede/follow/nocare.\n"; exit; } # ---------------- # WORD_PAIRS file # ---------------- if(!defined $ARGV[0]) { print STDERR "ERROR($0): Please specify the WORD_PAIRS file ...\n"; exit; } $pairsfile=$ARGV[0]; if(!-e $pairsfile) { print STDERR "ERROR($0): WORD_PAIRS file <$pairsfile> doesn't exist...\n"; exit; } open(PAIRS,$pairsfile) || die "Error($0): Error(code=$!) in opening <$pairsfile> file.\n"; # format for printing output if(defined $opt_format) { # integer if($opt_format=~/^i(\d+)$/) { $format_string="%$1d"; $lower_format="-"; while(length($lower_format)<($1-1)) { $lower_format.="9"; } if($lower_format eq "-") { $lower_format="0"; } $upper_format=""; while(length($upper_format)<($1-1)) { $upper_format.="9"; } } # float elsif($opt_format=~/^f(\d+)\.(\d+)$/) { $format_string="%$1\.$2f"; $lower_format="-"; while(length($lower_format)<($1-$2-2)) { $lower_format.="9"; } $lower_format.="."; while(length($lower_format)<($1-1)) { $lower_format.="9"; } $upper_format=""; while(length($upper_format)<($1-$2-2)) { $upper_format.="9"; } $upper_format.="."; while(length($upper_format)<($1-1)) { $upper_format.="9"; } } else { print STDERR "ERROR($0): Wrong format value --format=$opt_format.\n"; exit; } } # default is f16.10 for non-binary # and i2 for binary else { if(defined $opt_binary) { $format_string="%2d"; $lower_format="0"; $upper_format="1"; } else { $format_string="%16.10f"; $lower_format="-999.9999999999"; $upper_format="9999.9999999999"; } } # ------------------- # Target Word regex # ------------------- #file containing regex/s that define the target word if(defined $opt_extarget) { if(defined $opt_target) { $target_file=$opt_target; if(!(-e $target_file)) { print STDERR "ERROR($0): Target regex file <$target_file> doesn't exist.\n"; exit; } } else { $target_file="target.regex"; if(!-e $target_file) { print STDERR "ERROR($0): Please copy the target.regex file into the current directory or specify the target regex file via --target option.\n"; exit; } } # ------------------------ # creating target regex # ------------------------ open(REG,$target_file) || die "ERROR($0): Error(error code=$!) in opening the target regex file <$target_file>.\n"; while() { chomp; s/^\s+//g; s/\s+$//g; if(/^\s*$/) { next; } if(/^\//) { s/^\///; } else { print STDERR "ERROR($0): Regular Expression <$_> should start with '/'\n"; exit; } if(/\/$/) { s/\/$//; } else { print STDERR "ERROR($0): Regular Expression <$_> should end with '/'\n"; exit; } $target.="(".$_.")|"; } if(!defined $target) { print STDERR "ERROR($0): No valid Perl regular expression found in the target regex file <$target_file>.\n"; exit; } else { chop $target; } } if(defined $opt_feats) { $featfile=$opt_feats; if(-e $featfile) { open(FEAT,$featfile) || die "Error($0): Error(code=$!) in opening Feature file <$featfile>.\n"; $line_num=0; while() { $line_num++; # trimming extra spaces chomp; # handling non-unigram lines if(/^[\s\d]*$/ || ($_=~/^@/ && $_!~/<>/)) { next; } # -------------------------------- # Checking for valid unigram file # -------------------------------- $check_unigram=$_; $cnt=0; #count how many times <> occurs while($check_unigram=~/<>/) { $cnt++; $check_unigram=$'; } #should be 1 for unigrams if($cnt!=1) { print STDERR "ERROR($0): Given Feature file <$featfile> is not a valid Unigram output of NSP at line <$line_num>.\n"; exit; } # storing feature words if(/^(.*)<>\d*\s*$/) { push @features,$1; $feature_index{$1}=scalar(@features); } else { print STDERR "ERROR($0): Given Feature file <$featfile> is not a valid Unigram output of NSP at line <$line_num>.\n"; exit; } } } } ############################################################################## # ===================================== # Construct a Co-occurrence Table # from the Bigram file # ===================================== # read each entry in bigram file $line_num=0; while() { $line_num++; # trimming extra spaces chomp; # handling non-bigram lines if(/^[\s\d]*$/ || ($_=~/^@/ && $_!~/<>/)) { next; } # ------------------------------ # Checking for Valid bigram file # ------------------------------ $check_bigram=$_; $cnt=0; #count how many times <> occurs while($check_bigram=~/<>/) { $cnt++; $check_bigram=$'; } #should be 2 for bigrams if($cnt!=2) { print STDERR "ERROR($0): Given WORD_PAIRS file <$pairsfile> is not a valid Bigram output of NSP at line <$line_num>.\n"; exit; } # -------------------------------------------------- # Extracting words and their Co-occurrence scores # -------------------------------------------------- # output created by count.pl or combig.pl if(/^(.*)<>(.*)<>(\d+)\s+\d+\s+\d+\s*$/) { $word1=$1; $word2=$2; $score=$3; } # output created by statistic.pl elsif(/^(.*)<>(.*)<>\d+\s+(\-?\d*\.?\d+)\s+\d+\s+\d+\s+\d+\s*$/) { $word1=$1; $word2=$2; $score=$3; } else { print STDERR "ERROR($0): Given WORD_PAIRS file <$pairsfile> is not a valid Bigram output of NSP at line <$line_num>.\n"; exit; } # ignore pair if either of the features is a target word if(defined $opt_extarget && ($word1=~/^$target$/ || $word2=~/^$target$/)) { next; } # added by AKK 21st Feb 2005 # skip the word-pairs with 0 score # and the word-pairs whose score become 0 after formatting $value=sprintf $format_string,$score; # for binary representation if(defined $opt_binary && $value != 0) { $value = 1; } if($value<$lower_format) { print STDERR "ERROR($0): Floating point underflow. Value <$value> can't be represented with format $format_string.\n"; exit 1; } if($value>$upper_format) { print STDERR "ERROR($0): Floating point overflow. Value <$value> can't be represented with format $format_string.\n"; exit 1; } if($value==0) { next; } # end by AKK # wordorder = nocare when order of words in WORD_PAIRS file # doesn't matter # every word type is a feature as well as dimension if($opt_wordorder =~ /nocare/) { if(defined $featfile && -e $featfile) { if(defined $feature_index{$word1}) { if(!defined $dimension_index{$word2}) { push @dimensions, $word2; $dimension_index{$word2}=scalar(@dimensions); } $index1=$feature_index{$word1}; $index2=$dimension_index{$word2}; if(defined $coctable{$index1}{$index2}) { print STDERR "ERROR($0): Pair \"$word1<>$word2\" is repeated in the WORD_PAIRS file <$pairsfile>.\n"; exit; } if(defined $opt_binary) { $coctable{$index1}{$index2}=1; } else { $coctable{$index1}{$index2}=$score; } $nnz++; } if(($word1 ne $word2) && defined $feature_index{$word2}) { if(!defined $dimension_index{$word1}) { push @dimensions, $word1; $dimension_index{$word1}=scalar(@dimensions); } $index1=$feature_index{$word2}; $index2=$dimension_index{$word1}; if(defined $coctable{$index1}{$index2}) { print STDERR "ERROR($0): Pair \"$word1<>$word2\" is repeated in the WORD_PAIRS file <$pairsfile>.\n"; exit; } if(defined $opt_binary) { $coctable{$index1}{$index2}=1; } else { $coctable{$index1}{$index2}=$score; } $nnz++; } } else { # assigning numeric index to each feature/dimension if(!defined $index{$word1}) { push @features,$word1; push @dimensions,$word1; $index{$word1}=scalar(@features); } if(!defined $index{$word2}) { push @features,$word2; push @dimensions,$word2; $index{$word2}=scalar(@features); } $index1=$index{$word1}; $index2=$index{$word2}; # pair already seen if(defined $coctable{$index1}{$index2} || defined $coctable{$index2}{$index1}) { print STDERR "ERROR($0): Pair \"$word1<>$word2\" is repeated in the WORD_PAIRS file <$pairsfile>.\n"; exit; } if(!defined $opt_binary) { $coctable{$index1}{$index2}=$score; $coctable{$index2}{$index1}=$score; } else { $coctable{$index1}{$index2}=1; $coctable{$index2}{$index1}=1; } if($word1 ne $word2) { $nnz+=2; } else { $nnz++; } } } # wordorder = precede elsif($opt_wordorder =~ /precede/) { if(defined $featfile && -e $featfile && !defined $feature_index{$word2}) { next; } # with wordorder = precede, # word2 is a feature and # word1 is a dimension if(!defined $feature_index{$word2}) { push @features,$word2; $feature_index{$word2}=scalar(@features); } if(!defined $dimension_index{$word1}) { push @dimensions,$word1; $dimension_index{$word1}=scalar(@dimensions); } $index1=$feature_index{$word2}; $index2=$dimension_index{$word1}; if(defined $coctable{$index1}{$index2}) { print STDERR "ERROR($0): Pair \"$word1<>$word2\" is repeated in WORD_PAIRS file <$pairsfile>.\n"; exit; } if(!defined $opt_binary) { $coctable{$index1}{$index2}=$score; } else { $coctable{$index1}{$index2}=1; } $nnz++; } # wordorder = follow elsif($opt_wordorder =~ /follow/) { if(defined $featfile && -e $featfile && !defined $feature_index{$word1}) { next; } # with wordorder = follow # word1 is a feature and # word2 is a dimension if(!defined $feature_index{$word1}) { push @features,$word1; $feature_index{$word1}=scalar(@features); } if(!defined $dimension_index{$word2}) { push @dimensions,$word2; $dimension_index{$word2}=scalar(@dimensions); } $index1=$feature_index{$word1}; $index2=$dimension_index{$word2}; if(defined $coctable{$index1}{$index2}) { print STDERR "ERROR($0): Pair \"$word1<>$word2\" is repeated in WORD_PAIRS file <$pairsfile>.\n"; exit; } if(!defined $opt_binary) { $coctable{$index1}{$index2}=$score; } else { $coctable{$index1}{$index2}=1; } $nnz++; } } ############################################################################## # ========================= # Printing Word Vectors # ========================= print scalar(@features) . " " . scalar(@dimensions); if(!defined $opt_dense) { print " $nnz"; } print "\n"; # for each feature foreach $row (1..scalar(@features)) { if(defined $opt_dense) { # for each dimension foreach $col (1..scalar(@dimensions)) { # checking if feature-dimension co-occur # according to the bigram file if(defined $coctable{$row}{$col}) { $value=sprintf $format_string,$coctable{$row}{$col}; if($value<$lower_format) { print STDERR "ERROR($0): Floating point underflow. Value <$value> can't be represented with format $format_string.\n"; exit 1; } if($value>$upper_format) { print STDERR "ERROR($0): Floating point overflow. Value <$value> can't be represented with format $format_string.\n"; exit 1; } } else { $value=sprintf($format_string,0); } print $value; } } # print sparse word vectors else { @sparse_cols=keys %{$coctable{$row}}; @sorted_sparse_cols=sort {$a <=> $b} @sparse_cols; foreach $sparse_col (@sorted_sparse_cols) { # added by AKK on 21st Feb 2005 # filter out the 0 valued scores #$val = $coctable{$row}{$sparse_col}; $val=sprintf $format_string,$coctable{$row}{$sparse_col}; # if($val != 0) # { # print "$sparse_col $coctable{$row}{$sparse_col} "; print "$sparse_col $val "; # } } } print "\n"; } undef $opt_extarget; ############################################################################## # ===================================== # Reporting Features and Dimensions # ===================================== # ------------- # Feature file # ------------- if(defined $featfile && !-e $featfile) { open(FEAT,">$featfile") || die "ERROR($0): Error(code=$!) in opening Feature file <$featfile>.\n"; foreach (@features) { print FEAT "$_<>\n"; } } # ---------------- # Dimension file # ---------------- if(defined $opt_dims) { $dimfile=$opt_dims; if(-e $dimfile) { print STDERR "Warning($0): Dimension file <$dimfile> already exists, overwrite (y/n)? "; $ans=; } if(!-e $dimfile || $ans=~/Y|y/) { open(DIM,">$dimfile") || die "ERROR($0): Error(code=$!) in opening Dimension file <$dimfile>.\n"; foreach (@dimensions) { print DIM "$_<>\n"; } } } ############################################################################## # ========================== # SUBROUTINE SECTION # ========================== #----------------------------------------------------------------------------- #show minimal usage message sub showminimal() { print "Usage: wordvec.pl [OPTIONS] WORD_PAIRS"; print "\nTYPE wordvec.pl --help for help\n"; } #----------------------------------------------------------------------------- #show help sub showhelp() { print "Usage: wordvec.pl [OPTIONS] WORD_PAIRS Converts the given NSP output into a word-by-word association matrix. WORD_PAIRS Should be a bigram/co-occurrence score file as created by programs count.pl, combig.pl or statistics.pl from the NSP. OPTIONS: --wordorder WORDORD Specifies whether wordvec should retain or ignore the order of words in the WORD_PAIRS file. The possible values of WORDORD are - nocare - order of words in WORD_PAIRs is ignored. WORD_PAIRS are co-occurrence pairs created using combig.pl program from NSP follow - WORD_PAIRS are bigrams and word vectors show how frequently a dimension word (2nd word) follows a feature word (1st word) [Default] precede - WORD_PAIRS are bigrams and word vectors show how frequently a dimension word (1st word) precedes a feature word (2nd word) --binary Creates binary word vectors. --dense Creates dense word vectors. By default, output word vectors are sparse. --feats FEATFILE If the FEATFILE exists, features are extracted from this file, otherwise, automatically extracted features are written into the FEATFILE. --dims DIMFILE Writes extracted dimensions to DIMFILE. --target TARGET_REGEX Specifies a file containing Perl regex/s that define the target word. By default, target.regex is assumed to exist in current directory. --target is ignored unless --extarget is used. --extarget Ignores WORD_PAIRS in which either of the constituent words is a target word as specified via --target option or default target.regex file. --format FORM Specifies the numeric format for output representation. Default format for binary word vectors is i2 and for non-binary frequency vectors default format is f16.10. --help Displays this message. --version Displays the version information. Type 'perldoc wordvec.pl' to view detailed documentation of wordvec.\n"; } #------------------------------------------------------------------------------ #version information sub showversion() { print '$Id: wordvec.pl,v 1.24 2008/03/30 04:40:58 tpederse Exp $'; print "\nCreate word vectors from Text-NSP output\n"; # print "wordvec.pl - Version 0.4\n"; # print "Builds word vectors.\n"; # print "Copyright (c) 2002-2005, Amruta Purandare, Anagha Kulkarni & Ted Pedersen.\n"; # print "Date of Last Update: 03/04/2005\n"; } #############################################################################