#!/usr/local/bin/perl -w =head1 NAME prepare_sval2.pl - Makes sure Senseval-2 data is cleaned and has sense tags prior to invocation of SenseClusters =head1 SYNOPSIS prepare_sval2.pl [Options] SOURCE Here is a Senseval-2 file that is untagged cat notags.txt Output => he played on the offensive line in college i think the phone line is down Here is a key file that contains sense tags for these instances: cat key.txt Output => Now we can apply the tags in the key file to the previously untagged instances: prepare_sval2.pl notags.txt --key key.txt Output => he played on the offensive line in college i think the phone line is down Type C for quick summary of options =head1 DESCRIPTION This program prepares Senseval-2 Data for SenseClusters experiments by making sure that all instances have sense tags. Sense tags can be applied from a separate key file, and if any instances do not have tags, then a NOTAG is inserted. This program also deals with P tags that may exist in some Senseval data. The P tag indicates that the target word is a proper noun. In may cases P tagged instances are ommited from experiments since they are a different kind of sense. If "bush" were the target word, some instances might refer to "George Bush", which may not be one of the senses we wish to evaluate. Finally, this program can also deal with satellite tags that exist in some Senseval data. When the target word is a verb, in some cases it may have a satellite (particle), that we may or may not want to consider as a part of the target word. The satellite tags have identifiers in them that may cause parsing trouble, so they are often removed. =head1 INPUT =head2 Required Arguments: =head4 SOURCE A Senseval-2 formatted Data file that is to be prepared for the SenseClusters experiments. =head2 Optional Arguments: =head4 --key KEY Sense Tagging mechanism in prepare_sval2.pl - prepare_sval2.pl makes sure that all SOURCE instances are tagged with some answer tags (or NOTAGs at least). If the sense tags are found in the same SOURCE file, these will be retained, however if the SOURCE instances are not tagged, instances will be either attached "NOTAG"s or will be attached the sense tags given in the separate KEY file. A KEY file that has true answer keys of the SOURCE instances can be provided via --key option. If the SOURCE instances are not sense tagged, they will be tagged with the sense tags as given in the KEY file. KEY file should be in SenseClusters format showing []+ on each line where an instance id is followed by its true sense ids on a single line. prepare_sval2 takes into account following anamolies in SOURCE/KEY - =over 4 =item 1. If the 1st SOURCE instance is sense tagged, it assumes that SOURCE is sense tagged and will disable the KEY file option. If some of the SOURCE instances are not tagged, regardless of whether they have keys in KEY file or not, these are given "NOTAG"s. =item 2. If the 1st SOURCE instance is not sense tagged, it assumes that SOURCE is untagged and will give an error if any SOURCE instance is found sense tagged in the SOURCE file. =item 3. If the 1st SOURCE instance is not sense tagged and has an entry in the KEY file, it will enable the KEY file and will attach the instances with their answer keys as given in the KEY file. Any instance that doesn't have an answer key in the KEY file is attached "NOTAG". =item 4. If the 1st SOURCE instance is not sense tagged and doesn't have an entry in the KEY file, KEY file will be disabled and no instance will be attached a tag from the KEY file. All instances are given "NOTAG"s. =back =head4 --attachP P tag handling mechanism in prepare_sval2.pl - prepare_sval2.pl by default removes the sense tags that have value P. According to Senseval-2 standard, these are not true sense tags but indicate that the target word is a proper noun. --attachP option will attach a P tag to an immediately following sense tag for the same instance. e.g. If --attachP is selected, will be modified to and if --attachP is not selected, by default P tag will be removed as =head4 --modifysat This switch if selected will remove the satellite tag ids from and tags, retaining basic and tag information. e.g. by selecting --modifysat, Perhaps he 'd have called for a decentralized political and economic system will be transformed to perhaps he 'd have called for a decentralized political and economic system By not selecting --modifysat, the satellite ids would be retained. =head4 --nolc prepare_sval2 converts everything to lowercase by default. Select this switch to not do any case conversion. =head4 --help Displays this message. =head4 --version Displays the version information. =head1 OUTPUT Output will be a Senseval-2 file displayed to stdout. =head1 AUTHORS Amruta Purandare, University of Pittsburgh Ted Pedersen, University of Minnesota, Duluth tpederse at d.umn.edu =head1 COPYRIGHT Copyright (c) 2002-2008, Amruta Purandare and Ted Pedersen This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to The Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. =cut ############################################################################# # THE CODE STARTS HERE use utf8; ############################################################################### # ================================ # COMMAND LINE OPTIONS AND USAGE # ================================ # show minimal usage message if no arguments if($#ARGV<0) { &showminimal(); exit; } # command line options use Getopt::Long; GetOptions ("help","version","attachP","modifysat","key=s","nolc"); # show help option if(defined $opt_help) { $opt_help=1; &showhelp(); exit; } # show version information if(defined $opt_version) { $opt_version=1; &showversion(); exit; } ############################################################################# # ================================ # INITIALIZATION AND INPUT # ================================ #$0 contains the program name along with #the complete path. Extract just the program #name and use in error messages $0=~s/.*\/(.+)/$1/; if(!defined $ARGV[0]) { print STDERR "ERROR($0): Please specify the Senseval-2 Data file name...\n"; exit; } #accept the input file name $infile=$ARGV[0]; if(!-e $infile) { print STDERR "ERROR($0): Source file <$infile> doesn't exist...\n"; exit; } open(IN,$infile) || die "Error($0): Error(code=$!) in opening <$infile> file.\n"; ############################################################################## # =========================== # KEY file handling # =========================== # if the sense tags of the instances in Source file # are provided in KEY file, we attach them to source # instances if(defined $opt_key) { $keyfile=$opt_key; if(!-e $keyfile) { print STDERR "ERROR($0): KEY File <$keyfile> doesn't exist.\n"; exit; } open(KEY,$keyfile) || die "Error($0): Error(code=$!) in opening file <$keyfile>.\n"; $line_num=0; while() { $line_num++; chomp; # trimming extra spaces from beginning and end s/^\s+//g; s/\s+$//g; s/\s+/ /g; # handling blank lines if(/^\s*$/) { next; } #get the instance id from the key file if(//) { $instance=$1; $_=$'; if(defined $instance_hash{$instance}) { print STDERR "ERROR($0): Instance-Id <$instance> is repeated in the KEY file <$keyfile>.\n"; exit; } $instance_hash{$instance}++; } else { print STDERR "ERROR($0): Line <$line_num> in the KEY file <$keyfile> doesn't contain any tag.\n"; exit; } # get sense ids now while(//) { $sense=$1; $_=$'; if(defined $key_tab{$instance}{$sense}) { print "ERROR($0): The Instance-Id Sense-Tag pair <$instance $sense> is repeated in the KEY file <$keyfile>.\n"; exit; } # making an entry for the instance in the keytab $key_tab{$instance}{$sense}=1; } # checking if this instance has atleast one sense tag if(!defined $key_tab{$instance}) { print STDERR "ERROR($0): No Sense Id found at line <$line_num> in KEY file <$keyfile>.\n"; exit; } } } ############################################################################## #--------------------- #creating a TEMP file #--------------------- #we hold the output in tempfile till the program terminates #without an error. In case of error, the tempfile would be #retained and will hold partial output of the program. #use the system_defined date for unique name for tempfile #$date_time=scalar localtime; #@time_elements=split(/\s+/,$date_time); #$tempfile=join "_",@time_elements; $tempfile="temp".time().".prepare_sval2"; open(TEMP,">$tempfile")||die"ERROR($0): Internal System Error(code=$!).\n"; ############################################################################## # tag_flag=0 if data is untagged # =1 if tagged undef $tag_flag; undef $data_start; $line_num=0; # if tag=1, sense tags must be found for all instances $tag_found=0; while() { $line_num++; # KEY handling if(/instance id=\"([^\"]+)\"/) { $instance=$1; # we access key table only if data in untagged # otherwise key entries are ignored if(!defined $tag_flag || $tag_flag==0) { if(defined $key_tab{$instance}) { # attach_key = 1 # only if all instances have tags in KEY # =0 otherwise if(!defined $attach_key) { $attach_key=1; } foreach $sense (keys %{$key_tab{$instance}}) { $instance_sense{$instance}{$sense}=1; } } else { if(!defined $attach_key) { $attach_key=0; } } } } if(/sense\s*id=\"([^\"]+)\"/) { if(!defined $tag_flag) { $tag_flag=1; } # error if sense id is not expected elsif($tag_flag==0) { print STDERR "ERROR($0): No Sense Id is expected in Source file <$infile> for instance <$instance> as all earlier instances are untagged.\n"; exit; } if($1 ne "P") { $tag_found=1; } } if(defined $data_start && !defined $opt_nolc) { tr/A-Z/a-z/; } if(//) { $data_start=1; if(!defined $tag_flag) { $tag_flag=0; } # putting no tag if some instances aren't tagged elsif($tag_flag==1 && $tag_found==0) { print TEMP "\n"; } $tag_found=0; } if(/<\/context>/) { undef $data_start; undef $ptag; } if(defined $ptag && ($_ !~ /senseid=\"[^\"]+\"/)) { print STDERR "ERROR($0): P tag is not followed by any Sense tag at line<$line_num> in Senseval-2 file <$infile>\n."; exit; } # by default remove P tag if((!defined $opt_attachP) && /senseid=\"P\"/) { next; } # if --attachP defined attach P tag if(defined $opt_attachP && /senseid=\"P\"/) { $ptag=1; next; } if(defined $ptag && /senseid=\"([^\"]+)\"/) { $sense="P_".$1; s/sense\s*id=\"$1\"/senseid=\"$sense\"/; undef $ptag; } # if --modifysat used, remove sat ids from sat and head tags if(defined $opt_modifysat && //) { s///g; } if(defined $opt_modifysat && //) { s///g; } print TEMP $_; } undef $opt_attachP; undef $opt_modifysat; undef $opt_nolc; #now display to STDOUT close TEMP; open(TEMP,$tempfile) || die "ERROR($0): Internal System Error(code=$!).\n"; # read temp file and display with extra information while() { if(//) { if($tag_flag==0) { print "\n"; } elsif($tag_flag==1) { print "\n"; } else { print STDERR "ERROR($0): Error in Processing Data <$infile>.\n"; exit; } } elsif(/instance id=\"([^\"]+)\"/) { print; $instance=$1; # data untagged - either attach tag from KEY or put NOTAG if($tag_flag==0) { # get tag from the KEY file if(defined $attach_key && $attach_key==1) { if(defined $instance_sense{$instance}) { foreach $sense (keys %{$instance_sense{$instance}}) { if($sense ne "P") { print "\n"; } } } else { print "\n"; } } # put tag as NOTAG else { print "\n"; } } } else { print; } } #remove the tempfile unlink "$tempfile"; ############################################################################## # ========================== # SUBROUTINE SECTION # ========================== #----------------------------------------------------------------------------- #show minimal usage message sub showminimal() { print "Usage: prepare_sval2.pl [OPTIONS] SOURCE"; print "\nTYPE prepare_sval2.pl --help for help\n"; } #----------------------------------------------------------------------------- #show help sub showhelp() { print "Usage: prepare_sval2.pl [OPTIONS] SOURCE Prepares Senseval-2 Data by converting context data to lower case and some other preprocessing tasks like attaching sense tags, handling P tags and Sat tags. The modified file is displayed to stdout. Required Parameters - SOURCE Specify Senseval-2 Data file. Optional Parameters: --key KEY Tags SOURCE instances with their correct answer tags if these are provided in a KEY file. The format of a KEY file should show []+ where an Instance-Id is followed by its true sense tag/s on each line. --attachP Attaches P tags to the Sense Tags immediately following them. By default, P tags are removed since they indicate proper nouns. Note: attachP doesn't work when answer tags are provided in KEY file. But an option --attachP is provided in keyconvert.pl program that attaches P tags while converting format of KEY file to SenseClusters format. --modifysat Modifies satellite and head tags containing satellite ids like or , by replacing them with markers and . --nolc prepare_sval2.pl converts all characters to lowercase by default. Select --nolc switch not to do any case conversion. --help To display this message. --version To display the version information.\n"; } #------------------------------------------------------------------------------ #version information sub showversion() { # print "prepare_sval2.pl - Version 0.19\n"; print '$id$'; print "\nEnsure Senseval-2 data is sense tagged and cleaned\n"; # print "\nCopyright (c) 2002-2005, Amruta Purandare, Ted Pedersen.\n"; # print "Date of Last Update: 07/18/2003\n"; } #############################################################################