#!/usr/local/bin/perl -w =head1 NAME discriminate.pl Wrapper program to run SenseClusters in a single command =head1 SYNOPSIS Discriminates among the given text instances based on their contextual similarities. =head1 USAGE discriminate.pl [OPTIONS] TEST =head1 INPUT =head2 Required Arguments: =head3 TEST Senseval-2 formatted TEST instance file that contains the instances to be clustered. =head2 Optional Arguments: =head3 DATA OPTIONS : =head4 --training TRAIN Training file in plain text format that can be used to select features. If this is not specified, features are selected from the given TEST file. =head4 --split N Splits the given TEST file into two portions, N% for the use as the TRAIN data and (100-N)% as the TEST data. The value for N is a percentage and should be an integer between 1 to 99 (inclusive). The instances from the original TEST file are not picked or split in any particular order but are randomly split into the two portions of TRAIN and TEST data while maintaining the ratio of N/(100-N). Note: This option cannot be used when --training option is also used. =head4 --token TOKEN A file containing Perl regex/s that define the tokenization scheme in TRAIN and TEST files. If --token is not specified, default token regex file token.regex is searched in the current directory. =head4 --target TARGET A file containing Perl regex/s for identifying the target word. A sample target.regex file containing regex: /\w+/ is provided with this distribution. If --target is not specified, default target regex file target.regex is searched in the current directory. If this file doesn't exist, target.regex is automatically created by finding all instances of tags from the TEST data. If there are no instances of tags in TEST, the given data is assumed to be global and target word is not searched in either TRAIN or TEST. Note: --target cannot be specified with headless input data i.e. test file without head/target word(s). =head4 --prefix PRE Specify a prefix to be used in all output file names. e.g. context vector file will have name 'PRE.vectors', features file will have name 'PRE.features' and so on ... By default, a random prefix is created using the time stamp. =head4 --format f16.XX The default format for floating point numbers is f16.06. This means that there is room for 6 digits to the right of the decimal, and 9 to the left. You may change XX to any value between 0 and 15, however, the format must remain 16 spaces long due to formatting requirements of SVDPACKC. =head4 --wordclust Discriminates and clusters each word based upon its direct and indirect co-occurrence with other words (when used without the --lsa switch) or clusters words or features based upon their occurrences in different contexts (when used with the --lsa switch). Note: 1. Separate (--training) TRAIN data should not be used with word clustering. 2. Starting with Version 0.93, word clustering is no longer restricted to using only headless data. However, options specific to headed data such as --scope_test and target co-occurrence features (see below) cannot be used. =head4 --lsa Uses Latent Semantic Analysis (LSA) style representation for clustering features or contexts. LSA representation is the transpose of the context-by-feature matrix created using the native SenseClusters order1 context representation. This option can be used only in the following two combinations of the --context and the --wordclust options: =over =item 1. --context o1 --wordclust --lsa Performs feature clustering by grouping together features based on the contexts that they occur in. Features can be unigrams, bigrams or co-occurrences. Feature vectors are the rows of the transposed context-by-feature representation created by order1vec.pl. =item 2. --context o2 --lsa Performs context clustering by creating context vectors by averaging the feature vectors from the transposed context-by-feature representation of order1vec.pl. =back =head3 FEATURE OPTIONS : =head4 --feature TYPE Specify the feature type to be used for representing contexts. Possible options for feature type with first order context representation: bi - bigrams [default] tco - target co-occurrences co - co-occurrences uni - unigrams Possible options for feature type with second order context representation: bi - bigrams [default] co - co-occurrences tco - target co-occurrences Note: --tco (target co-occurrences) cannot be used with headless data i.e. test/train file without head/target word(s). =head4 --scope_train S1 Limits the scope of the training contexts to S1 words around (on both sides of) the TARGET word. Thus, it allows selection of local features. If --scope_train is used, each training instance is expected to include the target word as specified by the --target option or default target.regex. Note: --scope_train cannot be used with headless data i.e. train files without head/target word(s). =head4 --scope_test S2 Limits the scope of the test contexts to S2 words around (on both sides of) the TARGET word. Thus, it allows to match and use local features in the context vectors. Note: --scope_test cannot be used with headless data i.e. test files without head/target word(s). =head4 --stop STOPFILE A file of Perl regexes that define the stop list of words to be excluded from the features. STOPFILE could be specified with two modes - AND mode - declared by including '@stop.mode=AND' on the first line of the STOPFILE. - ignores word pairs in which both words are stop words. OR mode - declared by including '@stop.mode=OR' on the first line of the STOPFILE. - ignores word pairs in which either word is a stop word. Both modes exclude stop words from unigram features. Default is OR mode. =head4 --remove F Removes features that occur less than F times in the training corpus. =head4 --window W Specifies the window size for bigram/co-occurrence features. Pairs of words that co-occur within the specified window from each other (window W allows at most W-2 intervening words) will form the bigram/co-occurrence features. Default window size is 2 which allows only consecutive word pairs. Not applicable to unigram features. =head4 --stat STAT Bigrams and co-occurrences can be selected based on their statistical scores of association as specified by this option. If --vector = o2 and --stat is used, word association matrix will use the scores computed by the specified statistical test instead of simple joint frequency counts of the word pairs. Available tests of association are : dice - Dice Coefficient ll - Log Likelihood Ratio odds - Odds Ratio phi - Phi Coefficient pmi - Point-Wise Mutual Information tmi - True Mutual Information x2 - Chi-Squared Test tscore - T-Score leftFisher - Left Fisher's Test rightFisher - Right Fisher's Test By default, features are selected and represented using their frequency counts. =head4 --stat_rank N Word pairs ranking below N when arranged in descending order of their test scores are ignored. --stat_rank has no effect unless --stat is specified. =head4 --stat_score S Selects word pairs with scores greater than S after performing the selected test of association. Score could be any real number that will give reasonable number of features for the requested test. --stat_score has no effect unless --stat is specified. =head3 VECTOR OPTIONS : =head4 --context ORD Specifies the context representation to be used. Set ORD to 'o1' to use 1st order context vectors, and to 'o2' to select 2nd order context vectors. Default context representation is o2. =head4 --binary Creates binary feature and context vectors. By default, feature vectors show the joint frequency scores of the associated word pairs while the context vectors show the average of the feature vectors of words that occur in the context. With --binary turned ON, feature vectors show mere presence or absence of the particular word pair (co-occurrence/bigram) in TRAIN, while the context vectors will represent a binary 'OR' operation on the corresponding vectors of contextual features. =head3 SVD OPTIONS : =head4 --svd Reduces the feature space dimensions by performing Singular Value Decomposition (SVD). By default, all feature dimensions are retained. =head4 --k K Reduces the dimensions of the feature space to K. Default K = 300 =head4 --rf RF Specifies the scaling factor for reducing feature space dimensions such that feature space with N dimensions is reduced down to N/RF. Default RF = 4. RF should be an integer greater than 1. If both --k and --rf are specified, dimensions are reduced to min(k,N/RF). Note: If the reduced dimensions ( min(k,N/RF) ) turn-out to be less than or equal to 10 then svd is not performed. =head4 --iter I Specifies the number of iterations of SVD. Recommended value is 3 times the desired K. =head3 CLUSTER-STOPPING OPTIONS: =head4 --cluststop CS Specifies the cluster stopping measure to be used to predict the number the number of clusters. The possible option values: pk1 - Use PK1 measure [ PK1[m] = (crfun[m] - mean(crfun[1...deltaM]))/std(crfun[1...deltaM])) ] pk2 - Use PK2 measure [ PK2[m] = (crfun[m]/crfun[m-1]) ] pk3 - Use PK3 measure [ PK3[m] = ((2 * crfun[m])/(crfun[m-1] + crfun[m+1])) ] gap - Use Adapted Gap Statistic. pk - Use all the PK measures. all - Use all the four cluster stopping measures. More about these measures can be found in the documentation of Toolkit/clusterstop/clusterstopping.pl NOTE: Options --cluststop and --clusters (described under Clustering options) cannot be used together. =head4 --delta INT NOTE: Delta value can only be a positive integer value. Specify 0 to stop the iterating clustering process when two consecutive crfun values are exactly equal. This is the default setting when the crfun values are integer/whole numbers. Specify non-zero positive integer to stop the iterating clustering process when the difference between two consecutive crfun values is less than or equal to this value. However, note that the integer value specified is internally shifted to capture the difference in the least significant digit of the crfun values when these crfun values are fractional. For example: For crfun = 1.23e-02 & delta = 1 will be transformed to 0.0001 For crfun = 2.45e-01 & delta = 5 will be transformed to 0.005 The default delta value when the crfun values are fractional is 1. However if the crfun values are integer/whole numbers (exponent >= 2) then the specified delta value is internally shifted only until the least significant digit in the scientific notation. For example: For crfun = 1.23e+04 & delta = 2 will be transformed to 200 For crfun = 2.45e+02 & delta = 5 will be transformed to 5 For crfun = 1.44e+03 & delta = 1 will be transformed to 10 =head4 --threspk1 NUM Specifies the threshold value that should be used by the PK1 measure to predict the k value. Default = -0.7 NOTE: This option should be used only when --cluststop option is also used with option value of "all" or "pk1". =head3 CLUSTER-STOPPING: ADAPTED GAP STATISTIC OPTIONS: =head4 --B NUM The number of replicates/references to be generated. Default: 1 =head4 --typeref TYP Specifies whether to generate B replicates from a reference or to generate B references. The possible option values: rep - replicates [Default] ref - references =head4 --percentage NUM Specifies the percentage confidence to be reported in the log file. Since Gap Statistic uses parametric bootstrap method for reference distribution generation, it is critical to understand the interval around the sample mean that could contain the population ("true") mean and with what certainty. Default: 90 =head4 --seed NUM The seed to be used with the random number generator. Default: No seed is set. =head3 CLUSTERING OPTIONS : =head4 --clusters N Specifies number of clusters to be created. Default is set to 2. =head4 --space SPACE Specifies whether clustering is to be performed in vector or similarity space. Set the value of SPACE to 'vector' to perform clustering in vector space i.e. to cluster the context vectors directly. To cluster in similarity space by explicitly finding the pair-wise similarities among the contexts, set SPACE to 'similarity'. By default, clustering is performed in vector space. =head4 --clmethod CL Specifies the clustering method. Possible option values are : rb - Repeated Bisections [Default] rbr - Repeated Bisections for by k-way refinement direct - Direct k-way clustering agglo - Agglomerative clustering graph - Graph partitioning-based clustering bagglo - Partitional biased Agglomerative clustering For large amount of data, 'rb', 'rbr' or 'direct' are recommended. =head4 --crfun CR Selects the criteria function for Clustering. The meanings of these criteria functions are explained in Cluto's manual. The possible values are: i1 - I1 Criterion function i2 - I2 Criterion function [default for partitional] e1 - E1 Criterion function g1 - G1 Criterion function g1p - G1' Criterion function h1 - H1 Criterion function h2 - H2 Criterion function slink - Single link merging scheme wslink - Single link merging scheme weighted w.r.t. cluster sim clink - Complete link merging scheme wclink - Complete link merging scheme weighted w.r.t. cluster sim upgma - Group average merging scheme [default for agglomerative] Note that for cluster stopping, i1, i2, e1, h1 and h2 criterion functions can only be used. If a crfun other than these is selected then cluster stopping uses the default crfun (i2) while the final clustering of contexts is performed using the crfun specified. =head4 --sim SIM Specifies the similarity measure to be used for either vector or similarity space clustering. When --space = vector (or default), possible values of SIM are : cos - Cosine [default] corr - Correlation Coefficient dist - Euclidean distance jacc - Extended Jaccard Coefficient When --space = similarity and --binary is ON, possible values of SIM are - cos - Cosine [default] mat - Match jac - Jaccard ovr - Overlap dic - Dice Otherwise, only cosine measure is available and is default. The following table summarizes availability of similarity measures for 2 clustering approaches - vector(vcl) and similarity(scl) and on 2 different types of context vectors - binary Vs frequency vcl+bin vcl+freq scl+bin scl+freq cos Y Y Y Y mat N N Y N jacc Y Y Y N dice N N Y N ovr N N Y N dist Y Y N N corr Y Y N N The reasons are purely implementation issues and in future, we plan to support more consistent measures across these combinations. =head4 --rowmodel RMOD The option is used to specify the model to be used to scale every column of each row. (For further details please refer Cluto manual) The possible values for RMOD - none - no scaling is performed (default setting) maxtf - post scaling the values are between 0.5 and 1.0 sqrt - square-root of actual values log - log of actual values =head4 --colmodel CMOD The option is used to specify the model to be used to (globally) scale each column across all rows. (For further details please refer Cluto manual) The possible values for CMOD - none - no scaling is performed (default setting) idf - scaling according to inverse-document-frequency =head3 LABELING OPTIONS : Note: Labeling options cannot be used with word-clustering (--wordclust). =head4 --label_stop LABEL_STOPFILE A file of Perl regexes that define the stop list of words to be excluded from the features. LABEL_STOPFILE could be specified with two modes - AND mode - declared by including '@stop.mode=AND' on the first line of the LABEL_STOPFILE - ignores word pairs in which both words are stop words. OR mode - declared by including '@stop.mode=OR' on the first line of the LABEL_STOPFILE - ignores word pairs in which either word is a stop word. Default is OR. =head4 --label_remove LABEL_N Removes bigrams that occur less than LABEL_N times. =head4 --label_window LABEL_W Specifies the window size for bigrams. Pairs of words that co-occur within the specified window from each other (window LABEL_W allows at most LABEL_W-2 intervening words) will form the bigram features. Default window size is 2 which allows only consecutive word pairs. =head4 --label_stat LABEL_STAT Specifies the statistical scores of association. Available tests of association are : dice - Dice Coefficient ll - Log Likelihood Ratio odds - Odds Ratio phi - Phi Coefficient pmi - Point-Wise Mutual Information tmi - True Mutual Information x2 - Chi-Squared Test tscore - T-Score leftFisher - Left Fisher's Test rightFisher - Right Fisher's Test =head4 --label_rank LABEL_R Word pairs ranking below LABEL_R when arranged in descending order of their test scores are ignored. =head3 Other Options : =head4 --eval Evaluates clustering performance by computing precision and recall for maximally accurate assignment of sense tags to clusters. Maximal Assignment is when clusters are given sense labels such that maximum number of instances will be attached with their true sense tags. TEST instances tagged with multiple senses are automatically attached with the single sense-tag that is the most frequent among the attached tags. Note: This option can be used only if the answer tags are provided in the TEST file. =head4 --rank_filter R Allows to remove low frequency senses during evaluation. This will remove the senses that rank below R when senses in TEST are arranged in the descending order of their frequencies. In other words, it selects top R most frequent senses. An instance will be removed if it has all sense tags below rank R. =head4 --percent_filter P Allows to remove low frequency senses based on their percentage frequencies. This will remove senses whose frequency is below P% in the TEST data. If rank or percent filters are specified, they are applied after removing the multiple sense tags. =head4 --help Displays the quick summary of program options. =head4 --version Displays the version information. =head4 --verbose Displays to STDERR the current program status. =head4 --showargs Displays to STDOUT values of compulsory and required parameters. [NOT SUPPORTED IN THIS VERSION] =head1 OUTPUT discriminate.pl creates several output files. The discrimination of contexts performed by discriminate.pl, (i.e., a cluster assigned to each context) is given by the file $PREFIX.clusters if the number of clusters was set manually, otherwise by the file $PREFIX.clusters.$CLUSTSTOP where the $CLUSTSTOP specifies the cluster stopping measure that was used to predict the number of clusters. In addition, discriminate.pl also creates following files: NOTE: If a cluster stopping measure was used then it is indicated in the names of several output files by appending the cluster stopping measure name with the file name. Represented below as filename[.$CLUSTSTOP] =over =item * $PREFIX.clusters_context[.$CLUSTSTOP] - File containing all the input instances grouped by the cluster-id assigned to them. =item * $PREFIX[.$CLUSTSTOP].cluster.CLUSTERID - All the identified clusters and their instances are separated into different files. The filenames end with the cluster-id. e.g.: File containing instances of cluster 0 will be named as $PREFIX.cluster.0 =item * $PREFIX.report[.$CLUSTSTOP] - Confusion table if --eval is ON =item * $PREFIX.cluster_labels[.$CLUSTSTOP] - List of labels (word-pairs) assigned to each cluster. =item * $PREFIX[.$CLUSTSTOP].dendogram.ps - Dendograms + some information. =item * $PREFIX.features - Features file =item * $PREFIX.regex - File containing regular expressions for identifying the features listed in $PREFIX.features file. =item * $PREFIX.testregex - File containing only those regular expressions from the $PREFIX.regex file above, which match at least once in the test contexts, only created in second order context clustering mode (SC native as well as LSA) and LSA feature clustering mode =item * $PREFIX.wordvec - Word Vectors if --context = o2 =item * $PREFIX.vectors - Context Vectors =item * $PREFIX.rlabel - Row Labels of $PREFIX.vectors =item * $PREFIX.clabel - Column Labels of $PREFIX.vectors =item * $PREFIX.rclass - Class Ids of $PREFIX.vectors if --eval is ON =item * $PREFIX.cluster_solution[.$CLUSTSTOP] - Cluster ids of $PREFIX.vectors =item * $PREFIX.cluster_output[.$CLUSTSTOP] - Clustering program output =back =head3 Cluster Stopping related output files: =over =item * $PREFIX.pk1 - crfun[k] values, delta values, PK1[k] values and predicted k value =item * $PREFIX.pk2 - crfun[k] values, delta values, PK2[k] values and predicted k value =item * $PREFIX.pk3 - crfun[k] values, delta values, PK3[k] values and predicted k value =item * $PREFIX.gap - crfun[k] values, delta values and predicted k value =item * $PREFIX.gap.log - Gap(k), Obs(crfun(k)), Exp(crfun(k)) values etc. =back =head3 The following files are created to facilitate creation of plots, if needed: =over =item * $PREFIX.cr.dat - value-pairs :- k-value crfun-value =item * $PREFIX.pk1.dat - value-pairs :- k-value PK1[k] value =item * $PREFIX.pk2.dat - value-pairs :- k-value PK2[k] value =item * $PREFIX.pk3.dat - value-pairs :- k-value PK3[k] value =item * $PREFIX.gap.dat - value-pairs :- k-value Gap[k] value =item * $PREFIX.exp.dat - value-pairs :- k-value Exp(crfun[k]) value =back =head1 AUTHORS Ted Pedersen, University of Minnesota, Duluth tpederse at d.umn.edu Amruta Purandare, University of Pittsburgh Anagha Kulkarni, Carnegie-Mellon University Mahesh Joshi, Carnegie-Mellon Unversity =head1 COPYRIGHT Copyright (c) 2002-2008, Ted Pedersen, Amruta Purandare, Anagha Kulkarni, Mahesh Joshi This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to The Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. =cut ############################################################################### # THE CODE STARTS HERE #$0 contains the program name along with #the complete path. Extract just the program #name and use in error messages $0=~s/.*\/(.+)/$1/; ############################################################################### # ================================ # COMMAND LINE OPTIONS AND USAGE # ================================ use Math::SparseMatrix; # use the following perl module for command line options parsing # Do not allow abbreviations of options i.e. options have to be spelled out completely. use Getopt::Long qw(:config no_auto_abbrev); # command line options # catch, abort and print the message for unknown options specified eval(GetOptions ("help","version","training=s","token=s","target=s","stop=s","feature=s","remove=i","window=i","scope_train=i","scope_test=i","stat=s","stat_rank=i","stat_score=f","context=s","binary","svd","k=i","rf=i","iter=i","clusters=i","space=s","clmethod=s","crfun=s","sim=s","eval","verbose","showargs","prefix=s","format=s","rank_filter=i","percent_filter=f","label_window=i","label_stop=s","label_remove=i","label_stat=s","label_rank=i","wordclust","split=i","rowmodel=s","colmodel=s","cluststop=s","threspk1=f","delta=i","B=i","typeref=s","percentage=i","seed=i", "lsa")) or die("Please check the above mentioned option(s).\n"); # show help option if(defined $opt_help) { $opt_help=1; &showhelp(); exit; } # show version information if(defined $opt_version) { $opt_version=1; &showversion(); exit; } # show minimal usage message if no arguments if($#ARGV<0) { &showminimal(); exit 1; } ############################################################################# # ================================ # INITIALIZATION AND INPUT # ================================ # Note on ERROR message conventions - error and warning messages from # discriminate.pl should go to STDERR, and should be intended 1 tab. # Error messages from Toolkit programs should be indented 2 tabs. # TDP August, 2006 # ---------- # Testfile # ---------- if(!defined $ARGV[0]) { print STDERR "ERROR($0): Please specify the TEST file name...\n"; exit 1; } $testfile=$ARGV[0]; if(!-e $testfile) { print STDERR "ERROR($0): Could not locate the TEST file <$testfile>\n"; exit 1; } # --------------- # Tokenfile # --------------- if(defined $opt_token) { $token=$opt_token; } else { $token="token.regex"; } if(!-e $token) { print STDERR "ERROR($0): Could not locate the TOKEN file <$token>\n"; exit 1; } elsif(-z $token) { print STDERR "ERROR($0): TOKEN file <$token> is empty.\n"; exit 1; } # --------------- # Targetfile # --------------- my $target = ""; if(defined $opt_target) { $target=$opt_target; if(!-e $target) { print STDERR "ERROR($0): Could not locate the TARGET file <$target>\n"; exit 1; } } else { $target="target.regex"; # this will automatically create the target.regex file # in the current dir. if(!-e $target) { $status=system("maketarget.pl -head $testfile"); die "Error while running maketarget.pl on <$testfile>\n" unless $status==0; } } # -------------- # Prefix # -------------- if(defined $opt_prefix) { $prefix=$opt_prefix; } else { $prefix="expr" . time(); } # -------------- # Format # -------------- if(defined $opt_format) { if ($opt_format !~/^(f16.\d\d)/) { print STDERR "ERROR($0): --format must be of the form f16.XX, where 0 <= XX < 16, not $opt_format\n"; exit 1; } else { $format=$opt_format; ## format is defined, has valid form $format =~ /^f16.(\d\d)/; $prec = $1; # precision } } else { $format = "f16.06"; ## format is not defined, use default $prec = 6; } # -------------- # SVD options # -------------- if(!defined $opt_k) { $opt_k=300; } if(!defined $opt_rf) { $opt_rf=10; } # default feature if(!defined $opt_feature) { $opt_feature = "bi"; } # initialize the variable for default number of clusters $default_clusters = 2; # -------------- # Error checks # -------------- if(defined $opt_space) { if($opt_space !~/^(vector|simil)/) { print STDERR "ERROR($0): --space should be either 'vector' or 'similarity'.\n"; exit 1; } } if($opt_feature !~/^(bi(gram)?|co(occur|c)?|uni(gram)?|tco(occur|c)?)/) { print STDERR "ERROR($0): Specified Feature type --$opt_feature is not supported.\n"; exit 1; } if($opt_feature=~/^uni(gram)?/ && !defined $opt_lsa && (!defined $opt_context || $opt_context =~ /o2|order2/)) { print STDERR "ERROR($0): --feature cannot be $opt_feature when --context is o2, unless --lsa is also specified\n"; exit 1; } if(defined $opt_split && ($opt_split >=100 || $opt_split <= 0)) { print STDERR "ERROR($0): The N value for the --split option should be between 1 to 99\n"; exit 1; } # Option validations for Word Clustering and headless input data. # Find the type (headed/headless) of the Test and Train data # and then handle the following cases: # Case 1: train headless / test headed # Case 2: train headless / test headless # Case 3: train headed / test headed # Case 4: train headed / test headless my $TestType = 0; # By default headed my $TrainType = 0; # By default headed # check the Test data for tag open (INP,$testfile) || die "Error($0): Error(code=$!) in opening <$testfile> file.\n"; # read the complete file in single instruction instead of reading line by line. my $temp_delimiter = $/; $/ = undef; my $inp_str = ; $/ = $temp_delimiter; close INP; # If the --eval option specified Then check if answer tags present if(defined $opt_eval) { if($inp_str !~ m/.+<\/head>/i) { $TestType = 1; # headless } # when separate training data specified if(defined $opt_training) { # training data cannot be provided word clustering. if(defined $opt_wordclust) { if (defined $opt_lsa) { print STDERR "ERROR($0): --training option cannot be used with feature-clustering.\n"; } else { print STDERR "ERROR($0): --training option cannot be used with word-clustering.\n"; } exit 1; } # check if the training file exists if(!-e $opt_training) { print STDERR "ERROR($0): Could not locate the TRAIN file <$opt_training>\n"; exit 1; } # check if the training file is a text file. if(!-T $opt_training) { print STDERR "ERROR($0): Training file has to be a plain text file. The provided file is not a text file. \n"; exit 1; } open (INP,$opt_training) || die "Error($0): Error(code=$!) in opening <$opt_training> file.\n"; # read the complete file in single instruction instead of reading line by line. my $temp_delimiter = $/; $/ = undef; my $inp_str = ; $/ = $temp_delimiter; close INP; # check if the training file is senseval2 formatted file - if yes quit. if($inp_str =~ m/ tag if($inp_str !~ m/.+<\/head>/i) { $TrainType = 1; } } else # Test data to be used as Train data thus $TrainType = $TestType { $TrainType = $TestType; } # scope cannot be used with headless training data if (defined $opt_scope_train && $TrainType == 1) { print STDERR "ERROR($0): --scope_train option cannot be used when the Train data is headless.\n"; exit 1; } # scope cannot be used with headless test data, or when word clustering # is requested if (defined $opt_scope_test && ($TestType == 1 || defined $opt_wordclust)) { print STDERR "ERROR($0): --scope_test option cannot be used when the Test data is headless or when word clustering is requested.\n"; exit 1; } # word-clustering is treated as headless type of clustering thus # 1. check for target co-occurrence # 2. target option if(defined $opt_wordclust) { # we do no allow tco as the feature type, headed data is allowed but # ... is simply a normal token in this case if(defined $opt_feature && $opt_feature eq "tco") { print STDERR "ERROR($0): target co-occurrences (tco) cannot be used as the feature type with word-clustering.\n"; exit 1; } # headless case which cannot allow target file option if(defined $opt_target) { print STDERR "ERROR($0): --target option cannot be used with word-clustering.\n"; exit 1; } } # --lsa cannot be used in o1 context representation, unless word clustering is specified if (defined $opt_lsa) { if($opt_context =~ /o1|order1/ && !defined $opt_wordclust) { print STDERR "ERROR($0): --lsa option cannot be used with --context o2 without specifying the --wordclust option\n"; exit 1; } if((!defined $opt_context || $opt_context =~ /o2|order2/) && defined $opt_wordclust) { print STDERR "ERROR($0): --lsa option can be used either with \"--context o2\" (the default) or with \"--context o1 --wordclust\" options, but not with \"--context o2 --wordclust\".\n"; exit 1; } } # Case 1: train headless / test headed if($TrainType == 1 && $TestType == 0) { # headed case which cannot allow tco as the feature type if(defined $opt_feature && $opt_feature eq "tco") { print STDERR "ERROR($0): target co-occurrences (tco) cannot be used as the feature type when the Test/Train data is headless.\n"; exit 1; } } # Case 2: train headless / test headless And # Case 4: train headed / test headless if(($TrainType == 1 && $TestType == 1) || ($TrainType == 0 && $TestType == 1)) { # headless case which cannot allow target file option if(defined $opt_target) { print STDERR "ERROR($0): --target option cannot be used with headless clustering.\n"; exit 1; } # headless case which cannot allow tco as the feature type if(defined $opt_feature && $opt_feature eq "tco") { print STDERR "ERROR($0): target co-occurrences (tco) cannot be used as the feature type when the Test/Train data is headless.\n"; exit 1; } } # Case 3: train headed / test headed # No Special error checks required # Check if Test and Train specified by user and --split option is also used if(defined $opt_training && defined $opt_split) { print STDERR "ERROR($0): Cannot use --split option to split the input data into Test and Train portions if separate Training data (--training) is alredy specified.\n"; exit 1; } # if space is vector and clmethod is graph then only can # jacc and dist similarity measures be used. if((!defined $opt_space || $opt_space eq "vector") && (!defined $opt_clmethod || $opt_clmethod ne "graph") && defined $opt_sim && ($opt_sim eq "dist" || $opt_sim eq "jacc")) { print STDERR "ERROR($0): Similarity Measures (--sim) Euclidean distance and Jaccard can only be used if Clustering Method(--clmethod graph) is Graph and Clustering Space (--space vector) is Vector.\n"; exit 1; } if(defined $opt_space && $opt_space eq "similarity" && !defined $opt_binary && defined $opt_sim && $opt_sim ne "cos") { print STDERR "ERROR($0): Only Cosine Similarity Measure (--sim cos) is a valid option if Clustering space is similarity (--space similarity) and --binary option is not ON.\n"; exit 1; } if(defined $opt_space && $opt_space eq "similarity" && defined $opt_clmethod && $opt_clmethod eq "bagglo") { print STDERR "ERROR($0): Partitional biased Agglomerative clustering (--clmethod bagglo) available only for vector space.\n"; exit 1; } if(defined $opt_clmethod && $opt_clmethod ne "agglo" && defined $opt_crfun && ($opt_crfun eq "slink" || $opt_crfun eq "wslink" || $opt_crfun eq "clink" || $opt_crfun eq "wclink" || $opt_crfun eq "upgma")) { print STDERR "ERROR($0): $opt_crfun Criterion Function (--crfun $opt_crfun) valid only if Clustering Method is agglomerative (--clmethod agglo). \n"; exit 1; } # Error Checks for the rowmodel and colmodel options of Cluto if(defined $opt_rowmodel && $opt_rowmodel !~/^(none|maxtf|sqrt|log)$/) { print STDERR "ERROR($0): Specified rowmodel value: $opt_rowmodel is not supported.\n"; exit 1; } if(defined $opt_space && $opt_space eq "similarity" && defined $opt_rowmodel) { print STDERR "ERROR($0): --rowmodel option can be used only in vector space. \n"; exit 1; } if(defined $opt_colmodel && $opt_colmodel !~/^(none|idf)$/) { print STDERR "ERROR($0): Specified colmodel value: $opt_colmodel is not supported.\n"; exit 1; } if(defined $opt_space && $opt_space eq "similarity" && defined $opt_colmodel) { print STDERR "ERROR($0): --colmodel option can be used only in vector space. \n"; exit 1; } # cluster stopping related initializations and error checks # if neither #clusters nor cluster-stopping measure specified if(!defined $opt_clusters && !defined $opt_cluststop) { $opt_clusters = $default_clusters; } if(defined $opt_clusters && defined $opt_cluststop) { print STDERR "ERROR($0): --clusters and --cluststop options cannot be used together. \n"; exit 1; } if(defined $opt_cluststop && $opt_cluststop !~ /^(all|pk|pk1|pk2|pk3|gap)$/i) { print STDERR "ERROR($0): $opt_cluststop not a valid option value for --cluststop. \n"; exit 1; } if(!defined $opt_cluststop && defined $opt_threspk1) { print STDERR "ERROR($0): --threspk1 option can be used only when using --cluststop option. \n"; exit 1; } if(!defined $opt_cluststop && defined $opt_delta) { print STDERR "ERROR($0): --delta option can be used only when using --cluststop option. \n"; exit 1; } if(defined $opt_typeref && $opt_typeref !~ /^(rep|ref)$/i) { print STDERR "ERROR($0): $opt_typeref not a valid option value for --typeref. \n"; exit 1; } if(defined $opt_percentage && ($opt_percentage < 0 || $opt_percentage > 100)) { print STDERR "ERROR($0): The value for --percentage must be in the range [0,100] (inclusive).\n"; exit 1; } ############################################################################## # ------------------------- # Preprocessing # ------------------------- if(defined $opt_verbose) { print STDERR "Preprocessing the input data ...\n"; } # if TEST contains actual sense tags, # filter TEST to remove multiple # senses / instance if(defined $opt_eval) { # removing multiple senses of TEST instances $test_report="$prefix.test_report"; $status=system("frequency.pl $testfile > $test_report"); die "Error while running frequency.pl on <$testfile>\n" unless $status==0; $status=system("filter.pl --percent 0 --nomulti $testfile $test_report > $testfile.nomulti"); die "Error while running filter.pl on <$testfile>\n" unless $status==0; # applying filters now if(defined $opt_rank_filter || defined $opt_percent_filter) { if(defined $opt_verbose) { print STDERR "Removing Low Frequency Senses from TEST ...\n"; } if(defined $opt_rank_filter) { $filter_string="--rank $opt_rank_filter "; } else { $filter_string="--percent $opt_percent_filter "; } $status=system("filter.pl $filter_string $testfile.nomulti $test_report > $testfile.filtered"); die "Error while running filter.pl on <$testfile.nomulti>\n" unless $status==0; $testfile="$testfile.filtered"; } else { $testfile="$testfile.nomulti"; } } if(defined $opt_training) { $train_plain=$opt_training; $tmp_testfile = "$testfile.pro"; $status = system("preprocess.pl --token $token --removeNotToken --xml $tmp_testfile --nocount $testfile"); die "Error in running preprocess.pl on <$testfile>\n" unless $status==0; $testfile = $tmp_testfile; } else { if(defined $opt_split) { # convert test in sval2 to plain, process the test file and also split the data $train_plain="$prefix.train_plain"; $tmp_testfile = "$testfile.pro"; $status = system("preprocess.pl --token $token --removeNotToken --xml $tmp_testfile --count $train_plain --split $opt_split $testfile"); die "Error in running preprocess.pl on <$testfile>\n" unless $status==0; # delete the unnecessary file that get created by preprocessor.pl when used with the split option unlink "$tmp_testfile-training","$train_plain-test"; # use the appropriate test and train file henceforth $testfile = "$tmp_testfile-test"; $train_plain = "$train_plain-training"; $train_sval2=$testfile; } else { # convert test in sval2 to plain and also clean the test file $train_plain="$prefix.train_plain"; $tmp_testfile = "$testfile.pro"; $status = system("preprocess.pl --token $token --removeNotToken --xml $tmp_testfile --count $train_plain $testfile"); die "Error in running preprocess.pl on <$testfile>\n" unless $status==0; # use the clean test file henceforth $testfile = $tmp_testfile; $train_sval2=$testfile; } } ############################################ # Localizing the Context Scope in Training ############################################ if(defined $opt_scope_train) { if(defined $opt_verbose) { print STDERR "Localizing the Context Scope in TRAIN ...\n"; } if(!defined $train_sval2) { # converting training data to sval2 format $train_sval2="$prefix.train_sval2"; $status=system("text2sval.pl $train_plain > $train_sval2"); die "Could not run text2sval.pl on <$train_plain>\n" unless $status==0; } # running windower $train_context="$prefix.train_context"; if(defined $opt_target) { $status=system("windower.pl --plain --target $target --token $token $train_sval2 $opt_scope_train > $train_context"); die "Error while running windower.pl on <$train_sval2>\n" unless $status ==0; } else { $status=system("windower.pl --plain --token $token $train_sval2 $opt_scope_train > $train_context"); die "Error while running windower.pl on <$train_sval2>\n" unless $status ==0; } $train=$train_context; } else { $train=$train_plain; } ###################### # Selecting Features ###################### if($opt_feature =~ /^uni(gram)?/) { if(defined $opt_verbose) { print STDERR "Computing Unigram Counts ...\n"; } $unigrams="$prefix.unigrams"; $count_string=""; if(defined $opt_remove) { $count_string="--remove $opt_remove "; } if(defined $opt_stop) { $count_string.="--stop $opt_stop "; } $status=system("count.pl --ngram 1 --newLine --token $token $count_string $unigrams $train"); die "Error while running count.pl with --ngram 1 on <$train>\n" unless $status==0; } else { ########################### # Computing Bigram Counts ########################### if(defined $opt_verbose) { print STDERR "Computing Bigram Counts ...\n"; } $bigrams="$prefix.bigrams"; $count_string=""; if(defined $opt_remove) { $count_string="--remove $opt_remove "; } if(defined $opt_window) { $count_string.="--window $opt_window "; } if(defined $opt_stop) { $count_string.="--stop $opt_stop "; } $status=system("count.pl --extended --newLine --token $token $count_string $bigrams $train"); die "Error while running count.pl on <$train>\n" unless $status==0; ################### # Combining Counts ################### if($opt_feature =~/^(co(occur|c)?|tco(occur|c)?)/) { if(defined $opt_verbose) { print STDERR "Combining Bigrams into Co-occurrence pairs ...\n"; } # check the number of bigram features present open (INP,"<$bigrams") || die "Error($0): Error(code=$!) in opening <$bigrams> file\n"; my $feat_cnt = 0; while() { # skip the header if(/^@/) { next; } # capture the count if(/^(\d+)/) { $feat_cnt = $1; last; } } if(!$feat_cnt) { if($opt_feature =~/^tco(occur|c)?/) { print STDERR "ERROR($0): 0 FEATURES found in the <$bigrams> file. This will lead to 0 co-occurrence features and 0 target co-occurrence features. Therefore aborting the experiment.\n"; } else { print STDERR "ERROR($0): 0 FEATURES found in the <$bigrams> file. This will lead to 0 co-occurrence features. Therefore aborting the experiment.\n"; } exit 1; } $pairs="$prefix.cocs"; $status=system("combig.pl $bigrams > $pairs"); die "Error while running combig.pl on <$bigrams>\n" unless $status==0; if($opt_feature =~ /^tco(occur|c)?/) # target co-occurrences { if(defined $opt_verbose) { print STDERR "Finding Target Co-occurrences ...\n"; } # select the target co-occurrences from the *.cocs file $target_pairs = "$prefix.target_cocs"; open (INP,"<$pairs") || die "Error($0): Error(code=$!) in opening <$pairs> file\n"; open (OUT,">$target_pairs") || die "Error($0): Error(code=$!) in opening <$target_pairs> file.\n"; # select the word pairs with target word and write to a temp file # keep the count of number of such target word-pairs selected # extract the total number of features from the cocs file # usually the first number in the file. $total_feat = 0; do { $sent = ; if($sent =~ m/^\s*(\d+)\s*$/) { $total_feat = $1; } } until($total_feat != 0); # write the total number of features on the first line of the output file print OUT "$total_feat\n"; while() { # find and write out the target co-occurrences to the output file if(m/.+<\/head>/i) { print OUT; } } close INP; close OUT; $pairs=$target_pairs; } } else { $pairs=$bigrams; } ###################### # Running Statistic ###################### if(defined $opt_stat) { if(defined $opt_verbose) { print STDERR "Performing Statistics on Word Pairs ...\n"; } $statistic="$prefix.statistic"; $stat_string=""; if(defined $opt_stat_rank) { $stat_string.="--rank $opt_stat_rank "; } if(defined $opt_stat_score) { $stat_string.="--score $opt_stat_score "; } # included statistic.pl's --precision option if format option specified $stat_string .= " --precision $prec "; $stat_string.="$opt_stat "; $status=system("statistic.pl $stat_string $statistic $pairs"); die "Error while running statistic.pl on <$pairs>\n" unless $status ==0; $scores=$statistic; } else { $scores=$pairs; } } ############################# # Creating Context Vectors ############################# $vectors="$prefix.vectors"; # ------------------------- # defining context scope # ------------------------- if(defined $opt_scope_test) { if(defined $opt_verbose) { print STDERR "Localizing the Context Scope in TEST ...\n"; } $test_context="$prefix.test_context"; if(defined $opt_target) { $status=system("windower.pl --token $token --target $target $testfile $opt_scope_test > $test_context"); die "Error while running windower.pl on <$testfile>\n" unless $status==0; } else { $status=system("windower.pl --token $token $testfile $opt_scope_test > $test_context"); die "Error while running windower.pl on <$testfile>\n" unless $status==0; } } else { $test_context=$testfile; } $rlabel="$prefix.rlabel"; if(defined $opt_eval) { $rclass="$prefix.rclass"; $rclass_string="--rclass $rclass"; } else { $rclass_string=""; } $clabel="$prefix.clabel"; # turned ON if svd defined and actually applied my $svd_flag = 0; # default context representation is order2 if(!defined $opt_context || $opt_context =~/o2|order2/) { # do not rename any feature file to .features file yet, since # wordvec.pl produces a new .features file # just decide for now which is the features file if ($opt_feature =~ /^uni(gram)?/) { $featuresfile = $unigrams; } else { $featuresfile = $scores; } # check if atleast 10 feature present in the features file. open(FEAT,$featuresfile) || die "Error($0): Error(code=$!) while opening the feature file <$featuresfile>\n"; # read the complete file in single instruction instead of reading line by line. my $temp_delimiter = $/; $/ = undef; my $inp_str = ; $/ = $temp_delimiter; close FEAT; my $feat_cnt = 0; while($inp_str =~ m/<>.*\n/g && $feat_cnt < 10) { $feat_cnt++; } if($feat_cnt < 10) { print STDERR "ERROR($0): Only $feat_cnt FEATURES found in the <$scores> file. At least 10 FEATURES required to proceed with context representation.\n"; exit 1; } if (defined $opt_lsa) { # we will get feature vectors from a feature-by-context matrix # the extension is maintained to be .wordvec to be # consistent with the web interface interpretation as of now $featvec="$prefix.wordvec"; $features = "$prefix.features"; # move the appropriate feature output file as the .features file if ($opt_feature =~ /^uni(gram)?/) { $status = system("mv $unigrams $features"); die "Error while moving <$unigrams> file to <$features>\n" unless $status==0; } else { $status = system("mv $scores $features"); die "Error while moving <$scores> file to <$features>\n" unless $status==0; } # ----------------------- # finding feature regexs # ----------------------- if(defined $opt_verbose) { print STDERR "Finding Feature Regex/s ...\n"; } $feature_regex="$prefix.regex"; $status=system("nsp2regex.pl $features > $feature_regex"); die "Error while running nsp2regex.pl on <$features>\n" unless $status==0; if(defined $opt_verbose) { print STDERR "Building First Order Vectors for LSA...\n"; } # we are doing context clustering in lsa fashion # binary requested if(defined $opt_binary) { $binary="--binary"; } else { $binary=""; } $o1_presvd="$prefix.o1_presvd"; # do not generate the .rclass file and the .rlabel / .clabel files # generate the .testregex file which corresponds to the features # identified in the test data, this needs to be passed to # order2vec.pl later # Also specify --transpose option, for getting a feature-by-context # representation $testregex = "$prefix.testregex"; $status=system("order1vec.pl --transpose --testregex $testregex $binary $test_context $feature_regex > $o1_presvd"); die "Error while running order1vec.pl on <$test_context>\n" unless $status==0; # the keyfile produced by order1vec.pl should be removed, since later # order2vec.pl will create another one unlink ; # set input file for svd $svdinput = $o1_presvd; # set an output file name for svd $postsvdvectors = $featvec; } else { # we are doing either context clustering or word clustering in SC fashion if(defined $opt_verbose) { print STDERR "Building Word Vectors ...\n"; } $wordvec="$prefix.wordvec"; # creating word vectors from scores file $wordvec_presvd="$prefix.wordvec_presvd"; $features = "$prefix.features"; $dims="$prefix.dims"; $wordvec_string="--feats $features --dims $dims "; if($opt_feature=~/^co(occur|c)?|tco(occur|c)?/) { $wordvec_string.="--wordorder nocare "; } else { $wordvec_string.="--wordorder follow "; } if(defined $opt_binary) { $wordvec_string.="--binary "; } $status=system("wordvec.pl --format $format $wordvec_string $scores > $wordvec_presvd"); die "ERROR($0): Error while running wordvec.pl\n" unless $status==0; # set input file for svd $svdinput = $wordvec_presvd; # set an output file name for svd $postsvdvectors = $wordvec; } # SVD if(defined $opt_svd) { # Check if performing svd will reduce the number of features i.e. number of columns # less than or equal to 10, if so do not perform svd open(INSVD,$svdinput) || die "Error($0): Error(code=$!) in opening Matrix file <$svdinput>\n"; # line1 in Matrix file should either show the # tag or #rows #cols #nnz $line1=; if($line1=~/keyfile/) { $line1=; } if($line1=~/^\s*(\d+)\s+(\d+)\s+(\d+)\s*$/) { $rows=$1; $cols=$2; $nnz1=$3; } else { print STDERR "ERROR($0): Line $line1 in Matrix file <$svdinput> should show #rows #cols #nnz\n"; exit 1; } close INSVD; $flag_svd = 0; $maxprs=$opt_k > ($cols/$opt_rf) ? int($cols/$opt_rf) : $opt_k; if($maxprs >= 10) { if(defined $opt_verbose) { print STDERR "Performing SVD ...\n"; } $svd_flag = 1; # calling svd(input,output) svd($svdinput, $postsvdvectors); $flag_svd = 1; } else { print STDERR "WARNING($0): SVD could not be performed on SVDINPUT <$svdinput> because svd with reduction factor = $opt_k and scaling factor = $opt_rf would reduce the resultant number of features to = $maxprs, computed via (min($opt_k, $cols/$opt_rf)). The minimum number of features required for representing the contexts is 10\n"; $status=system("mv $svdinput $postsvdvectors"); die "Error while creating <$postsvdvectors> file.\n" unless $status==0; } } else { $status=system("mv $svdinput $postsvdvectors"); die "Error while creating <$postsvdvectors> file.\n" unless $status==0; } # If word clustering (synonym finding) do not create context vectors but # instead pass the word vectors to the clustering stage. if(defined $opt_wordclust) { $status=system("mv $wordvec $vectors"); die "Error while creating <$vectors> file.\n" unless $status==0; $status=system("mv $features $rlabel"); die "Error while creating <$rlabel> file.\n" unless $status==0; $status=system("mv $dims $clabel"); die "Error while creating <$clabel>\n" unless $status==0; } else { # -------------------------- # Creating Context Vectors # -------------------------- if (!defined $opt_lsa) { # only in native SC order2 context clustering mode, generate # a regex file from the output of wordvec.pl. we don't do # this immediately after calling wordvec.pl above as that # will be unnecessarily created in SC word clustering mode # generate a .testregex file from the $features file created by # wordvec.pl $testregex = "$prefix.testregex"; $status=system("nsp2regex.pl $features > $testregex"); die "Error while running nsp2regex.pl on <$features>\n" unless $status==0; } if(defined $opt_verbose) { print STDERR "Building 2nd Order Context Vectors ...\n"; } $context_string="--rlabel $rlabel "; if(defined $opt_svd && $flag_svd == 1) { $context_string.="--dense "; } if(defined $opt_binary) { $context_string.="--binary "; } $status=system("order2vec.pl --format $format $context_string $rclass_string $test_context $postsvdvectors $testregex > $vectors"); die "Error while running order2vec.pl on <$test_context>\n" unless $status==0; } } # requested context type is order1 else { $features="$prefix.features"; if($opt_feature=~/^uni(gram)?/) { $status=system("mv $unigrams $features"); die "Error while creating Unigram Feature file <$features>\n" unless $status==0; } else { $status=system("mv $scores $features"); die "Error while creating Bigram Feature file <$features>\n" unless $status==0; } # else # target co-occurrences # { # if(defined $opt_verbose) # { # print STDERR "Finding Target Co-occurrences ...\n"; # } # # run kocos to find co-occurrences from scores file # $status=system("kocos.pl --order 1 --regex $target $scores > $features"); # die "Error while running kocos.pl on $scores.\n" unless $status==0; # } # check if atleast 10 feature present in the features file. open(FEAT,$features) || die "Error($0): Error(code=$!) while opening the feature file <$features>\n"; # read the complete file in single instruction instead of reading line by line. my $temp_delimiter = $/; $/ = undef; my $inp_str = ; $/ = $temp_delimiter; close FEAT; my $feat_cnt = 0; while($inp_str =~ m/<>.*\n/g && $feat_cnt < 10) { $feat_cnt++; } if($feat_cnt < 10) { print STDERR "ERROR($0): Only $feat_cnt FEATURES found in the <$scores> file. At least 10 FEATURES required to proceed with context representation.\n"; exit 1; } # ----------------------- # finding feature regexs # ----------------------- if(defined $opt_verbose) { print STDERR "Finding Feature Regex/s ...\n"; } $feature_regex="$prefix.regex"; $status=system("nsp2regex.pl $features > $feature_regex"); die "Error while running nsp2regex.pl on <$features>\n" unless $status==0; # ------------------------- # creating context vectors # ------------------------- if(defined $opt_verbose) { print STDERR "Building 1st Order Context Vectors ...\n"; } # binary requested if(defined $opt_binary) { $binary="--binary"; } else { $binary=""; } $o1_presvd="$prefix.o1_presvd"; if (defined $opt_lsa) { # do not create .rclass file and .clabel file in word / feature # clustering # create the .rlabel file and specify --transpose option to get # feature-by-context output # MJ - 06/30/2006 # we also need to specify --testregex option with --transpose, # although we don't use it in LSA feature clustering. $testregex = "$prefix.testregex"; $status=system("order1vec.pl --transpose --testregex $testregex --rlabel $rlabel $binary $test_context $feature_regex > $o1_presvd"); } else { # print STDERR "order1vec.pl $binary --rlabel $rlabel $rclass_string --clabel $clabel $test_context $feature_regex > $o1_presvd\n"; $status=system("order1vec.pl $binary --rlabel $rlabel $rclass_string --clabel $clabel $test_context $feature_regex > $o1_presvd"); } die "ERROR ($0): Error (code=$!) while running order1vec.pl on <$test_context>\n" unless $status==0; $svdinput = $o1_presvd; # SVD if(defined $opt_svd) { # Check if performing svd will reduce the number of features i.e. number of columns # less than or equal to 10, if so do not perform svd open(INSVD,$svdinput) || die "Error($0): Error(code=$!) in opening Matrix file <$svdinput>\n"; # line1 in Matrix file should either show the # tag or #rows #cols #nnz $line1=; if($line1=~/keyfile/) { $line1=; } if($line1=~/^\s*(\d+)\s+(\d+)\s+(\d+)\s*$/) { $rows=$1; $cols=$2; $nnz1=$3; } else { print STDERR "ERROR($0): Line $line1 in Matrix file <$svdinput> should show #rows #cols #nnz\n"; exit 1; } close INSVD; $maxprs=$opt_k > ($cols/$opt_rf) ? int($cols/$opt_rf) : $opt_k; if($maxprs >= 10) { if(defined $opt_verbose) { print STDERR "Performing SVD ...\n"; } $svd_flag = 1; # calling svd function svd($svdinput,$vectors); } else { print STDERR "WARNING($0): SVD could not be performed on SVDINPUT <$svdinput> because svd with reduction factor = $opt_k and scaling factor = $opt_rf would reduce the resultant number of features to = $maxprs, computed via (min($opt_k, $cols/$opt_rf)). The minimum number of features required for representing the contexts is 10\n"; $status=system("mv $svdinput $vectors"); die "Error while creating file <$vectors>\n" unless $status==0; } } else { $status=system("mv $svdinput $vectors"); die "Error while creating file <$vectors>\n" unless $status==0; } } ############## # Clustering ############## # cluster stopping param string $cluststop_str = ""; # params common to both vcluster and scluster $cluster_str ="--rlabelfile $rlabel "; if(defined $opt_clmethod) { $cluster_str .="--clmethod $opt_clmethod "; if($opt_clmethod =~ /^(rb|rbr|direct|agglo|bagglo)$/i) { $cluststop_str .="--clmethod $opt_clmethod "; } else { $cluststop_str .="--clmethod rb "; } } if(defined $opt_crfun) { $cluster_str .="--crfun $opt_crfun "; if($opt_crfun =~ /^(i1|i2|h1|h2|e1)$/i) { $cluststop_str .="--crfun $opt_crfun "; } else { $cluststop_str .="--crfun i2 "; } } # cluster in vector space if(!defined $opt_space || $opt_space =~/^vector$/) { if(defined $opt_verbose) { print STDERR "Clustering in Vector Space ...\n"; } # build the string of params for vcluster $vclus_str = $cluster_str; if(defined $opt_sim) { $vclus_str .= "--sim $opt_sim "; if($opt_sim =~ /^(cos|corr)$/i) { $cluststop_str .= "--sim $opt_sim "; } else { $cluststop_str .= "--sim cos "; } if($opt_sim =~ /^co/) { $vclus_str .="--showfeatures "; } } $clabel_str = ""; if (-f $clabel) { $clabel_str = "--clabel $clabel"; } $vclus_str .="--nfeatures 10 $clabel_str "; # row scaling option if(defined $opt_rowmodel) { $vclus_str .= "--rowmodel $opt_rowmodel "; $cluststop_str .= "--rowmodel $opt_rowmodel "; } else { $vclus_str .= "--rowmodel none "; $cluststop_str .= "--rowmodel none "; } # column scaling option if(defined $opt_colmodel) { $vclus_str .= "--colmodel $opt_colmodel "; $cluststop_str .= "--colmodel $opt_colmodel "; } else { $vclus_str .= "--colmodel none "; $cluststop_str .= "--colmodel none "; } # cluster stopping if(defined $opt_cluststop) { $cluststop = $opt_cluststop; if(defined $opt_verbose) { print STDERR "Finding Number of Clusters with Cluster Stopping...\n"; } if(defined $opt_threspk1) { $cluststop_str .= "--threspk1 $opt_threspk1 "; } if(defined $opt_delta) { $cluststop_str .= "--delta $opt_delta "; } if(defined $opt_B) { $cluststop_str .= "--B $opt_B "; } if(defined $opt_typeref) { $cluststop_str .= "--typeref $opt_typeref "; } if(defined $opt_percentage) { $cluststop_str .= "--percentage $opt_percentage "; } if(defined $opt_seed) { $cluststop_str .= "--seed $opt_seed "; } $cluststop_str .= "--space vector --measure $opt_cluststop --precision $prec "; $status = system("clusterstopping.pl --prefix $prefix $cluststop_str $vectors >& $prefix.predictions"); # error handling for clusterstopping.pl if ($status != 0) { my $tmp = uc $opt_cluststop; # if predictions file not created fall-back to using the default #clusters if(!-e "$prefix.predictions") { print STDERR "WARNING($0): Could not locate the PREDICTIONS <$prefix.predictions> file which indicates that the cluster-stopping measure $tmp failed to predict the optimal number of clusters for the VECTORS <$vectors> file. Proceeding with the default number of clusters of $default_clusters\n\n"; # default number of clusters $opt_clusters = $default_clusters; } else { # if predictions file exists then print out the error message present in the file # and fall-back to using the default #clusters open (TFP,"$prefix.predictions"); $errstr = ""; while() { $errstr .= $_; } print STDERR "WARNING($0): $errstr The cluster-stopping measure $tmp failed to predict the optimal number of clusters for <$vectors> Proceeding with the default number of clusters of $default_clusters\n\n"; # default #clusters $opt_clusters = $default_clusters; } # undefine cluster-stopping option to indicate that the #clusters being used is not # predicted by the measures but is set manually to the default value. $opt_cluststop = undef; # proceed with the default #clusters $num_k = 0; $predict[$num_k] = $opt_clusters; $cluster_solution ="$prefix.cluster_solution"; $cluster_output ="$prefix.cluster_output"; $vclus_str .="--clustfile $cluster_solution "; # running vcluster # use the -showtree option only if the #clusters is greater than 1 if($opt_clusters > 1) { my $tmp_fig_str = "--showtree --plotclusters $prefix.dendogram.ps --plotformat ps "; system("vcluster $vclus_str $rclass_string $tmp_fig_str $vectors $opt_clusters > $cluster_output"); } else { system("vcluster $vclus_str $rclass_string $vectors $opt_clusters > $cluster_output"); } } else # If clusterstopping.pl ran successfully. { open (TFP,"$prefix.predictions") || die "Error($0): Error(code=$!) in opening <$prefix.predictions> file.\n"; $num_k = 0; while() { chomp; $predict[$num_k++] = $_; } $num_k--; close TFP; $i = 0; while($i <= $num_k) { $opt_clusters = $predict[$i]; if($cluststop ne "all" && $cluststop ne "pk") { $cluster_solution ="$prefix.cluster_solution.$cluststop"; $cluster_output ="$prefix.cluster_output.$cluststop"; $dendo_file = "$prefix.$cluststop.dendogram.ps"; } else { if($i == 0) { $cluster_solution ="$prefix.cluster_solution.pk1"; $cluster_output ="$prefix.cluster_output.pk1"; $dendo_file = "$prefix.pk1.dendogram.ps"; } elsif($i == 1) { $cluster_solution ="$prefix.cluster_solution.pk2"; $cluster_output ="$prefix.cluster_output.pk2"; $dendo_file = "$prefix.pk2.dendogram.ps"; } elsif($i == 2) { $cluster_solution ="$prefix.cluster_solution.pk3"; $cluster_output ="$prefix.cluster_output.pk3"; $dendo_file = "$prefix.pk3.dendogram.ps"; } elsif($i == 3) { $cluster_solution ="$prefix.cluster_solution.gap"; $cluster_output ="$prefix.cluster_output.gap"; $dendo_file = "$prefix.gap.dendogram.ps"; } } $update_str ="--clustfile $cluster_solution "; # running vcluster # use the -showtree option only if the #clusters is greater than 1 if($opt_clusters > 1) { my $tmp_fig_str = "--showtree --plotclusters $dendo_file --plotformat ps "; system("vcluster $vclus_str $update_str $rclass_string $tmp_fig_str $vectors $opt_clusters > $cluster_output"); } else { system("vcluster $vclus_str $update_str $rclass_string $vectors $opt_clusters > $cluster_output"); } $i++; } } } else # if not using cluster stopping measures { $num_k = 0; $predict[$num_k] = $opt_clusters; $cluster_solution ="$prefix.cluster_solution"; $cluster_output ="$prefix.cluster_output"; $vclus_str .="--clustfile $cluster_solution "; # running vcluster # use the -showtree option only if the #clusters is greater than 1 if($opt_clusters > 1) { my $tmp_fig_str = "--showtree --plotclusters $prefix.dendogram.ps --plotformat ps "; system("vcluster $vclus_str $rclass_string $tmp_fig_str $vectors $opt_clusters > $cluster_output"); } else { system("vcluster $vclus_str $rclass_string $vectors $opt_clusters > $cluster_output"); } } } else # cluster in similarity space { if(defined $opt_verbose) { print STDERR "Building Similarity Matrix ...\n"; } # creating similarity matrix $simat="$prefix.simat"; my $simat_string = " "; if(defined $opt_svd && $svd_flag == 1) { $simat_string ="--dense "; } if(defined $opt_binary) { if(defined $opt_sim) { $simat_string .="--measure $opt_sim "; } $sim_program ="bitsimat.pl"; } else { $sim_program ="simat.pl"; } $status=system("$sim_program --format $format $simat_string $vectors > $simat"); die "Error while running $sim_program\n" unless $status==0; if(defined $opt_verbose) { print STDERR "Clustering in Similarity Space ...\n"; } # build the string of params for scluster $sclus_str = $cluster_str; # cluster stopping if(defined $opt_cluststop) { $cluststop = $opt_cluststop; if(defined $opt_verbose) { print STDERR "Finding Number of Clusters with Cluster Stopping...\n"; } if(defined $opt_threspk1) { $cluststop_str .= "--threspk1 $opt_threspk1 "; } if(defined $opt_delta) { $cluststop_str .= "--delta $opt_delta "; } if(defined $opt_B) { $cluststop_str .= "--B $opt_B "; } if(defined $opt_typeref) { $cluststop_str .= "--typeref $opt_typeref "; } if(defined $opt_percentage) { $cluststop_str .= "--percentage $opt_percentage "; } if(defined $opt_seed) { $cluststop_str .= "--seed $opt_seed "; } $cluststop_str .= "--space similarity --measure $opt_cluststop --precision $prec "; $status = system("clusterstopping.pl --prefix $prefix $cluststop_str $simat >& $prefix.predictions"); # error handling for clusterstopping.pl # If clusterstopping.pl returned an error code if ($status != 0) { my $tmp = uc $opt_cluststop; # if predictions file not created fall-back to using the default #clusters if(!-e "$prefix.predictions") { print STDERR "WARNING($0): Could not locate the PREDICTIONS <$prefix.predictions> file which indicates that the cluster-stopping measure $tmp failed to predict the optimal number of clusters for the VECTORS <$vectors> file. Proceeding with the default number of clusters of $default_clusters\n\n"; # default #clusters $opt_clusters = $default_clusters; } else { # if predictions file exists then print out the error message present in the file # and fall-back to using the default #clusters open (TFP,"$prefix.predictions"); $errstr = ""; while() { $errstr .= $_; } print STDERR "WARNING($0): $errstr The cluster-stopping measure $tmp failed to predict the optimal number of clusters for the given data. Proceeding with the default number of clusters of $default_clusters\n\n"; # default #clusters $opt_clusters = $default_clusters; } # undefine cluster-stopping option to indicate that the #clusters being used is not # predicted by the measures but is set manually to the default value. $opt_cluststop = undef; # proceed with the default #clusters $num_k = 0; $predict[$num_k] = $opt_clusters; $cluster_solution ="$prefix.cluster_solution"; $cluster_output ="$prefix.cluster_output"; $sclus_str .="--clustfile $cluster_solution "; # running scluster # use the -showtree option only if the #clusters is greater than 1 if($opt_clusters > 1) { my $tmp_fig_str = "--showtree --plotsclusters $prefix.dendogram.ps --plotformat ps "; system("scluster $sclus_str $rclass_string $tmp_fig_str $simat $opt_clusters > $cluster_output"); } else { system("scluster $sclus_str $rclass_string $simat $opt_clusters > $cluster_output"); } } else # If clusterstopping.pl ran successfully. { open (TFP,"$prefix.predictions") || die "Error($0): Error(code=$!) in opening <$prefix.predictions> file.\n"; $num_k = 0; while() { chomp; $predict[$num_k++] = $_; } $num_k--; close TFP; $i = 0; while($i <= $num_k) { $opt_clusters = $predict[$i]; if($cluststop ne "all" && $cluststop ne "pk") { $cluster_solution ="$prefix.cluster_solution.$cluststop"; $cluster_output ="$prefix.cluster_output.$cluststop"; $dendo_file = "$prefix.$cluststop.dendogram.ps"; } else { if($i == 0) { $cluster_solution ="$prefix.cluster_solution.pk1"; $cluster_output ="$prefix.cluster_output.pk1"; $dendo_file = "$prefix.pk1.dendogram.ps"; } elsif($i == 1) { $cluster_solution ="$prefix.cluster_solution.pk2"; $cluster_output ="$prefix.cluster_output.pk2"; $dendo_file = "$prefix.pk2.dendogram.ps"; } elsif($i == 2) { $cluster_solution ="$prefix.cluster_solution.pk3"; $cluster_output ="$prefix.cluster_output.pk3"; $dendo_file = "$prefix.pk3.dendogram.ps"; } elsif($i == 3) { $cluster_solution ="$prefix.cluster_solution.gap"; $cluster_output ="$prefix.cluster_output.gap"; $dendo_file = "$prefix.gap.dendogram.ps"; } } $update_str ="--clustfile $cluster_solution "; # running scluster # use the -showtree option only if the #clusters is greater than 1 if($opt_clusters > 1) { my $tmp_fig_str = "--showtree --plotsclusters $dendo_file --plotformat ps "; system("scluster $sclus_str $update_str $rclass_string $tmp_fig_str $simat $opt_clusters > $cluster_output"); } else { system("scluster $sclus_str $update_str $rclass_string $simat $opt_clusters > $cluster_output"); } $i++; } } } else # if not using cluster stopping measures { $num_k = 0; $predict[$num_k] = $opt_clusters; $cluster_solution ="$prefix.cluster_solution"; $cluster_output ="$prefix.cluster_output"; $sclus_str .="--clustfile $cluster_solution "; # running scluster # use the -showtree option only if the #clusters is greater than 1 if($opt_clusters > 1) { my $tmp_fig_str = "--showtree --plotsclusters $prefix.dendogram.ps --plotformat ps "; system("scluster $sclus_str $rclass_string $tmp_fig_str $simat $opt_clusters > $cluster_output"); } else { system("scluster $sclus_str $rclass_string $simat $opt_clusters > $cluster_output"); } } } #********************* # formatting clustering solution, show instances in each cluster $i = 0; while($i <= $num_k) { if(defined $opt_cluststop) { if($cluststop ne "all" && $cluststop ne "pk") { $clusters="$prefix.clusters.$cluststop"; $cluster_solution = "$prefix.cluster_solution.$cluststop"; $clusters_context = "$prefix.clusters_context.$cluststop"; } else { if($i == 0) { $clusters="$prefix.clusters.pk1"; $cluster_solution = "$prefix.cluster_solution.pk1"; $clusters_context = "$prefix.clusters_context.pk1"; } elsif($i == 1) { $clusters="$prefix.clusters.pk2"; $cluster_solution = "$prefix.cluster_solution.pk2"; $clusters_context = "$prefix.clusters_context.pk2"; } elsif($i == 2) { $clusters="$prefix.clusters.pk3"; $cluster_solution = "$prefix.cluster_solution.pk3"; $clusters_context = "$prefix.clusters_context.pk3"; } elsif($i == 3) { $clusters="$prefix.clusters.gap"; $cluster_solution = "$prefix.cluster_solution.gap"; $clusters_context = "$prefix.clusters_context.gap"; } } } else # No. of Clusters: Set Manually { $clusters="$prefix.clusters"; $cluster_solution = "$prefix.cluster_solution"; $clusters_context = "$prefix.clusters_context"; } if(defined $opt_wordclust) { $status=system("format_clusters.pl $cluster_solution $rlabel > $clusters"); die "Error while formatting clusters.\n" unless $status==0; } else { $status=system("format_clusters.pl $cluster_solution $rlabel --senseval2 $testfile > $clusters"); die "Error while formatting clusters.\n" unless $status==0; # execute the format_clusters.pl with --context option and use this file to label the clusters. $status=system("format_clusters.pl $cluster_solution $rlabel --context $testfile > $clusters_context"); die "Error while running format_clusters.pl $cluster_solution $rlabel --context $testfile > $clusters_context\n" unless $status==0; } $i++; } if(!defined $opt_wordclust) { # create the parameter string for clusterlabeling.pl if(defined $opt_verbose) { print STDERR "Creating Cluster Labels ...\n"; } $cluslabel_str = " --token $token "; if(defined $opt_label_window) { $cluslabel_str .= " --window $opt_label_window "; } if(defined $opt_label_stop) { $cluslabel_str .= " --stop $opt_label_stop "; } if(defined $opt_label_remove) { $cluslabel_str .= " --remove $opt_label_remove "; } if(defined $opt_label_stat) { $cluslabel_str .= " --stat $opt_label_stat "; } if(defined $opt_label_rank) { $cluslabel_str .= " --rank $opt_label_rank "; } $i = 0; while($i <= $num_k) { if(defined $opt_cluststop) { if($cluststop ne "all" && $cluststop ne "pk") { $clusters_context = "$prefix.clusters_context.$cluststop"; $cluster_labels = "$prefix.cluster_labels.$cluststop"; $param_str = $cluslabel_str . "--prefix $prefix.$cluststop "; } else { if($i == 0) { $clusters_context = "$prefix.clusters_context.pk1"; $cluster_labels = "$prefix.cluster_labels.pk1"; $param_str = $cluslabel_str . "--prefix $prefix.pk1 "; } elsif($i == 1) { $clusters_context = "$prefix.clusters_context.pk2"; $cluster_labels = "$prefix.cluster_labels.pk2"; $param_str = $cluslabel_str . "--prefix $prefix.pk2 "; } elsif($i == 2) { $clusters_context = "$prefix.clusters_context.pk3"; $cluster_labels = "$prefix.cluster_labels.pk3"; $param_str = $cluslabel_str . "--prefix $prefix.pk3 "; } elsif($i == 3) { $clusters_context = "$prefix.clusters_context.gap"; $cluster_labels = "$prefix.cluster_labels.gap"; $param_str = $cluslabel_str . "--prefix $prefix.gap "; } } } else # No. of Clusters: Set Manually { $clusters_context = "$prefix.clusters_context"; $cluster_labels = "$prefix.cluster_labels"; $param_str = $cluslabel_str . "--prefix $prefix "; } # execute the cluster labeling program $status=system("clusterlabeling.pl $param_str $clusters_context > $cluster_labels"); die "Error while running clusterlabeling.pl $param_str $clusters_context > $cluster_labels\n" unless $status==0; $i++; } } ################ # Evaluation ################ if(defined $opt_eval) { if(defined $opt_verbose) { print STDERR "Evaluating ...\n"; } $i = 0; while($i <= $num_k) { if(defined $opt_cluststop) { if($cluststop ne "all" && $cluststop ne "pk") { $prelabel="$prefix.prelabel.$cluststop"; $label="$prefix.label.$cluststop"; $report="$prefix.report.$cluststop"; $cluster_solution ="$prefix.cluster_solution.$cluststop"; } else { if($i == 0) { $prelabel="$prefix.prelabel.pk1"; $label="$prefix.label.pk1"; $report="$prefix.report.pk1"; $cluster_solution ="$prefix.cluster_solution.pk1"; } elsif($i == 1) { $prelabel="$prefix.prelabel.pk2"; $label="$prefix.label.pk2"; $report="$prefix.report.pk2"; $cluster_solution ="$prefix.cluster_solution.pk2"; } elsif($i == 2) { $prelabel="$prefix.prelabel.pk3"; $label="$prefix.label.pk3"; $report="$prefix.report.pk3"; $cluster_solution ="$prefix.cluster_solution.pk3"; } elsif($i == 3) { $prelabel="$prefix.prelabel.gap"; $label="$prefix.label.gap"; $report="$prefix.report.gap"; $cluster_solution ="$prefix.cluster_solution.gap"; } } } else # No. of Clusters: Set Manually { $prelabel="$prefix.prelabel"; $label="$prefix.label"; $report="$prefix.report"; $cluster_solution ="$prefix.cluster_solution"; } $status=system("cluto2label.pl $cluster_solution keyfile*.key > $prelabel"); die "Error while running cluto2label.pl\n" unless $status==0; $status=system("label.pl $prelabel > $label"); die "Error while running label.pl\n" unless $status==0; $status=system("report.pl $label $prelabel > $report"); die "Error while running report.pl\n" unless $status==0; $i++; } $status=system("mv keyfile*.key $prefix.key"); die "Error while creating the KEY file.\n" unless $status==0; } ################## # Printing Output ################## if(defined $opt_cluststop) { if($opt_cluststop eq "all") { $predict_measure[0] = "PK1 measure"; $predict_measure[1] = "PK2 measure"; $predict_measure[2] = "PK3 measure"; $predict_measure[3] = "Adapted Gap Statistic"; } elsif($opt_cluststop eq "pk") { $predict_measure[0] = "PK1 measure"; $predict_measure[1] = "PK2 measure"; $predict_measure[2] = "PK3 measure"; } else { $predict_measure[0] = uc $opt_cluststop; $predict_measure[0] .= " measure"; } } else { $predict_measure[0] = "Set manually"; } $i = 0; while($i <= $num_k) { print "\n=================================================================\n"; print "Output when #clusters = $predict[$i] ($predict_measure[$i])\n"; print "=================================================================\n"; if(defined $opt_cluststop) { if($cluststop ne "all" && $cluststop ne "pk") { $cluster_output ="$prefix.cluster_output.$cluststop"; $status=system("cat $cluster_output"); die "Error while displaying the cluster results.\n" unless $status==0; if(defined $opt_eval) { $report = "$prefix.report.$cluststop"; $status=system("cat $report"); die "Error while displaying the report file.\n" unless $status==0; } $clusters="$prefix.clusters.$cluststop"; print "\nClusters of given contexts can be found in file: <$clusters>\n\n"; } else { if($i == 0) { $cluster_output ="$prefix.cluster_output.pk1"; $status=system("cat $cluster_output"); die "Error while displaying the cluster results.\n" unless $status==0; if(defined $opt_eval) { $report = "$prefix.report.pk1"; $status=system("cat $report"); die "Error while displaying the report file.\n" unless $status==0; } $clusters="$prefix.clusters.pk1"; print "\nClusters of given contexts can be found in file: $clusters\n\n"; } elsif($i == 1) { $cluster_output ="$prefix.cluster_output.pk2"; $status=system("cat $cluster_output"); die "Error while displaying the cluster results.\n" unless $status==0; if(defined $opt_eval) { $report = "$prefix.report.pk2"; $status=system("cat $report"); die "Error while displaying the report file.\n" unless $status==0; } $clusters="$prefix.clusters.pk2"; print "\nClusters of given contexts can be found in file: $clusters\n\n"; } elsif($i == 2) { $cluster_output ="$prefix.cluster_output.pk3"; $status=system("cat $cluster_output"); die "Error while displaying the cluster results.\n" unless $status==0; if(defined $opt_eval) { $report = "$prefix.report.pk3"; $status=system("cat $report"); die "Error while displaying the report file.\n" unless $status==0; } $clusters="$prefix.clusters.pk3"; print "\nClusters of given contexts can be found in file: $clusters\n\n"; } elsif($i == 3) { $cluster_output ="$prefix.cluster_output.gap"; $status=system("cat $cluster_output"); die "Error while displaying the cluster results.\n" unless $status==0; if(defined $opt_eval) { $report = "$prefix.report.gap"; $status=system("cat $report"); die "Error while displaying the report file.\n" unless $status==0; } $clusters="$prefix.clusters.gap"; print "\nClusters of given contexts can be found in file: $clusters\n\n"; } } } else # No. of Clusters: Set Manually { $cluster_output ="$prefix.cluster_output"; $status=system("cat $cluster_output"); die "Error while displaying the cluster results.\n" unless $status==0; if(defined $opt_eval) { $report = "$prefix.report"; $status=system("cat $report"); die "Error while displaying the report file.\n" unless $status==0; } $clusters="$prefix.clusters"; print "\nClusters of given contexts can be found in file: $clusters\n\n"; } $i++; } ############################################################################## # ========================== # SUBROUTINE SECTION # ========================== sub svd { ($svdin,$svdout)=@_; # converting input to harwell-boeing format $svd_string=""; if(defined $opt_k) { $svd_string="--k $opt_k "; } if(defined $opt_rf) { $svd_string.="--rf $opt_rf "; } if(defined $opt_iter) { $svd_string.="--iter $opt_iter "; } $numform = "5$format"; ## numform is 5f16.XX $status=system("mat2harbo.pl --numform $numform --param $svd_string $svdin > matrix"); die "Error while running mat2harbo.pl on <$svdin>\n" unless $status==0; system("las2"); $harbomat="$prefix.harbomat"; $status=system("mv matrix $harbomat"); die "Error in creating <$harbomat>\n" unless $status==0; # reconstruction $status=system("svdpackout.pl --rowonly --format $format lav2 lao2 > $svdout"); die "Error while running svdpackout.pl\n" unless $status==0; } #----------------------------------------------------------------------------- #show minimal usage message sub showminimal() { print "Usage: discriminate.pl [OPTIONS] TEST"; print "\nTYPE discriminate.pl --help for help\n"; } #----------------------------------------------------------------------------- #show help sub showhelp() { print "Usage: discriminate.pl [OPTIONS] TEST Wrapper program for SenseClusters' Toolkit. Discriminates among the given text instances based on their contextual similarities. TEST Senseval-2 formatted TEST instance file containing the instances to be clustered. OPTIONS: --training TRAIN Specify the training file in plain text format. Instances from this file are used for selecting features. If --training is not specified, features are selected from the same TEST file. --split N Splits the given TEST file into two portions, N% for the use as the TRAIN data and (100-N)% as the TEST data. The value for N is a percentage and should be an integer between 1 to 99 (inclusive). The instances from the original TEST file are not picked or split in any particular order but are randomly split into the two portions of TRAIN and TEST data while maintaining the ratio of N/(100-N). Note: This option cannot be used when --training option is also used. --token TOKEN Specify a file containing Perl regex/s that define the tokenization scheme in TRAIN and TEST files. By default, token.regex is searched in the current directory. --target TARGET Specify a file containing Perl regex/s that identify the target word/s whose senses are to be discriminated. If --target is not specified, target.regex file is searched in the current directory. If this file doesn't exist, target.regex is automatically created by searching the tags in the TEST data. If no tags are found in TEST, TEST is assumed to be global. Note: --target cannot be specified with headless input data i.e. test file without head/target word(s). --prefix PRE Specify the prefix to be used for output filenames. --format f16.XX The default format for floating point numbers is f16.06. This means that there is room for 6 digits to the left of the decimal, and 9 to the right. You may change XX to any value between 0 and 15, however, the format must remain 16 spaces long due to formatting requirements of SVDPACKC. --wordclust Discriminates and clusters each word based upon its direct and indirect co-occurrence with other words (when used without the --lsa switch) or clusters words or features based upon their occurrences in different contexts (when used with the --lsa switch). Note: 1. Separate (--training) TRAIN data should not be used with word clustering. 2. Starting with Version 0.93, word clustering is no longer restricted to using only headless data. However, options specific to headed data such as --scope_test and target co-occurrence features (see below) cannot be used. --lsa Uses Latent Semantic Analysis (LSA) style representation for clustering features or contexts. LSA representation is the transpose of the context-by-feature matrix created using the native SenseClusters order1 context representation. This option can be used only in the following two combinations of the --context and the --wordclust options: 1. --context o1 --wordclust --lsa Performs feature clustering by grouping together features based on the contexts that they occur in. Features can be unigrams, bigrams or co-occurrences. Feature vectors are the rows of the transposed context-by-feature representation created by order1vec.pl. 2. --context o2 --lsa Performs context clustering by creating context vectors by averaging the feature vectors from the transposed context-by-feature representation of order1vec.pl. Feature Options : --feature TYPE Specify the feature type to be used for representing contexts. Possible options for feature type with first order context representation: bi - bigrams [default] tco - target co-occurrences co - co-occurrences uni - unigrams Possible options for feature type with second order context representation: bi - bigrams [default] co - co-occurrences tco - target co-occurrences Note: --tco (target co-occurrences) cannot be used with headless data i.e. test/train file without head/target word(s). --scope_train S1 Context in TRAIN instances is limited to include only S1 words on the left and right of the TARGET word. Use --scope_train only if every training instance contains the TARGET word. Note: --scope_train cannot be used with headless data i.e. train file without head/target word(s). --scope_test S2 Context in TEST instances is limited to include only S2 words on the left and right of the TARGET word. Use --scope_test only if every test instance contains the TARGET word. Note: --scope_test cannot be used with headless data i.e. test file without head/target word(s). --remove F Features occurring less than F number of times are removed from the feature set. --window W Sets the window size for bigram and co-occurrence features. Words occurring within W positions from each other (i.e. at most W-2 intervening words) form bigrams/co-occurrences. --stop STOPFILE Specify a file of Perl regex/s that define a stop list of words to be excluded from the features. --stat Stat Performs the specified statistical test of association on bigrams/ co-occurrences. The test scores can be used to filter insignificant pairs or in the feature vector representations. The possible values of STAT are - dice - Dice Coefficient ll - Log Likelihood Ratio odds - Odds Ratio phi - Phi Coefficient pmi - Point-Wise Mutual Information tmi - True Mutual Information x2 - Chi-Squared Test tscore - T-Score leftFisher - Left Fisher's Test rightFisher - Right Fisher's Test --stat_rank R Word pairs ranking below R when arranged in descending order of their test scores are ignored. --stat_rank will be ignored unless --stat option is specified. --stat_score S Specify the score cutoff value to select pairs with test scores greater than S. --stat_score will be ignored unless option --stat is specified. Vector Options : --context ORD Specify the context representation to be used to represent the TEST instances. Set ORD to 'o1' to use 1st order context vectors and to 'o2' to use 2nd order context vectors. Default context representation is o2. --binary Creates binary feature and context vectors. By default, the frequency scores are retained by these vectors. SVD Options : --svd Performs Singular Value Decomposition to reduce the feature space dimensions. --k K Reduces dimensions of the feature space to K. Default is 300. --rf RF Specifies the reduction factor such that feature space with N dimensions is reduced down to N/RF (RF >= 1). Default RF=10. --iter I Specifies the number of SVD iterations. Recommended value is (3 x K) Cluster-Stopping Options: --cluststop CS Specify the cluster stopping measure to be used to predict the number the number of clusters. The possible option values: pk1 - Use PK1 measure [PK1[m] = (crfun[m] - mean(crfun[1...deltaM]))/std(crfun[1...deltaM]))] pk2 - Use PK2 measure [PK2[m] = (crfun[m]/crfun[m-1])] pk3 - Use PK3 measure [PK3[m] = ((2 * crfun[m])/(crfun[m-1] + crfun[m+1]))] gap - Use Adapted Gap Statistic. pk - Use all the PK measures. all - Use all the four cluster stopping measures. More about these measures can be found in the documentation of Toolkit/clusterstop/clusterstopping.pl NOTE: Options --clusters and --cluststop cannot be used together. --delta INT NOTE: Delta value can only be a positive integer value. Specify 0 to stop the iterating clustering process when two consecutive crfun values are exactly equal. This is the default setting when the crfun values are integer/whole numbers. Specify non-zero positive integer to stop the iterating clustering process when the difference between two consecutive crfun values is less than or equal to this value. However, note that the integer value specified is internally shifted to capture the difference in the least significant digit of the crfun values when these crfun values are fractional. For example: For crfun = 1.23e-02 & delta = 1 will be transformed to 0.0001 For crfun = 2.45e-01 & delta = 5 will be transformed to 0.005 The default delta value when the crfun values are fractional is 1. However if the crfun values are integer/whole numbers (exponent >= 2) then the specified delta value is internally shifted only until the least significant digit in the scientific notation. For example: For crfun = 1.23e+04 & delta = 2 will be transformed to 200 For crfun = 2.45e+02 & delta = 5 will be transformed to 5 For crfun = 1.44e+03 & delta = 1 will be transformed to 10 --threspk1 NUM The threshold value that should be used by the PK1 measure to predict the k value. Default = -0.7 NOTE: This option should be used only when --cluststop option is also used with option value of \"all\" or \"pk1\". Cluster-Stopping: Adapted Gap Statistic Options: --B NUM The number of replicates/references to be generated. Default: 1 --typeref TYP Specifies whether to generate B replicates from a reference or to generate B references. The possible option values: rep - replicates [Default] ref - references --percentage NUM Specifies the percentage confidence to be reported in the log file. Since Gap Statistic uses parametric bootstrap method for reference distribution generation, it is critical to understand the interval around the sample mean that could contain the population (\"true\") mean and with what certainty. Default: 90 --seed NUM The seed to be used with the random number generator. Default: No seed is set. Clustering Options : --clusters C Specify the number of clusters to be created. Default is 2. --space SPACE Specifies whether clustering is to be performed in vector or similarity space. Set SPACE to 'vector' to cluster context vectors directly in vector space OR to 'similarity' to compose a similarity matrix and cluster instances in similarity space. Default SPACE is vector. --clmethod CL Specifies the clustering method. Possible option values are : rb - Repeated Bisections [Default] rbr - Repeated Bisections for by k-way refinement direct - Direct k-way clustering agglo - Agglomerative clustering graph - Graph partitioning-based clustering bagglo - Partitional biased Agglomerative clustering --crfun CR Selects the criteria function for Clustering. The meanings of these criteria functions is explained in Cluto's manual. The possible values are : i1 - I1 Criterion function i2 - I2 Criterion function [default for partitional] e1 - E1 Criterion function g1 - G1 Criterion function g1p - G1' Criterion function h1 - H1 Criterion function h2 - H2 Criterion function slink - Single link merging scheme wslink - Single link merging scheme weighted w.r.t. cluster sim clink - Complete link merging scheme wclink - Complete link merging scheme weighted w.r.t. cluster sim upgma - Group average merging scheme [default for agglomerative] Note that for cluster stopping, i1, i2, e1, h1 and h2 criterion functions can only be used. If a crfun other than these is selected then cluster stopping uses the default crfun (i2) while the final clustering of contexts is performed using the crfun specified. --sim SIM Specifies the similarity measure to be used during clustering. When --space is vector, possible option values of SIM are : cos - Cosine Coefficient [default] corr - Correlation Coefficient dist - Euclidean distance jacc - Extended Jaccard Coeeficient When --space is similarity and --binary is ON, possible values of SIM are : cos - Cosine Coefficient [default] mat - Match Coefficient jac - Jaccard Coefficient ovr - Overlap Coefficient dic - Dice Coefficient Otherwise, only cosine coefficient is available and is default. --rowmodel RMOD The option is used to specify the model to be used to scale every column of each row. (For further details please refer Cluto manual) The possible values for RMOD - none - no scaling is performed (default setting) maxtf - post scaling the values are between 0.5 and 1.0 sqrt - square-root of actual values log - log of actual values --colmodel CMOD The option is used to specify the model to be used to (globally) scale each column across all rows. (For further details please refer Cluto manual) The possible values for CMOD - none - no scaling is performed (default setting) idf - scaling according to inverse-document-frequency Labeling Options : Note: Labeling options cannot be used with word-clustering (--wordclust). --label_stop LABEL_STOPFILE A file of Perl regexes that define the stop list of words to be excluded from the labels. --label_remove LABEL_N Removes bigrams that occur less than LABEL_N times. --label_window LABEL_W Specifies the window size for bigrams. Pairs of words that co-occur within the specified window from each other (window LABEL_W allows at most LABEL_W-2 intervening words) will form the bigram features. Default window size is 2 which allows only consecutive word pairs. --label_stat LABEL_STAT Specifies the statistical scores of association. Available tests of association are : dice - Dice Coefficient ll - Log Likelihood Ratio odds - Odds Ratio phi - Phi Coefficient pmi - Point-Wise Mutual Information tmi - True Mutual Information x2 - Chi-Squared Test tscore - T-Score leftFisher - Left Fisher's Test rightFisher - Right Fisher's Test --label_rank LABEL_R Word pairs ranking below LABEL_R when arranged in descending order of their test scores are ignored. Other Options : --eval Evaluates clustering performace by comparing results against correct answer keys. Note: This option can be used only if the answer tags are provided in the TEST file. --rank_filter R Allows to remove low frequency senses during evaluation. This will remove the senses that rank below R when senses in TEST are arranged in the descending order of their frequencies. In other words, it selects top R most frequent senses. An instance will be removed if it has all sense tags below rank R. --percent_filter P Allows to remove low frequency senses based on their percentage frequencies. This will remove senses whose frequency is below P% in the TEST data. --showargs Displays to STDOUT values of compulsory and optional arguments. [NOT SUPPORTED IN THIS VERSION] --verbose Displays to STDERR the current program status. --help Displays this message. --version Displays the version information. Type 'perldoc discriminate.pl' to view the detailed documentation of discriminate.\n"; } #------------------------------------------------------------------------------ #version information sub showversion() { print '$Id: discriminate.pl,v 1.104 2008/03/24 04:06:47 tpederse Exp $'; # print "Wrapper program for using SenseClusters' Toolkit.\n"; print "\nCopyright (c) 2002-2006, Ted Pedersen, Amruta Purandare, Anagha Kulkarni, & Mahesh Joshi\n"; ## print "Date of Last Update: 07/30/2006\n"; } #############################################################################