package Algorithm::SVMLight; use strict; use DynaLoader (); use vars qw($VERSION @ISA); $VERSION = '0.09'; @ISA = qw(DynaLoader); __PACKAGE__->bootstrap( $VERSION ); sub new { my $package = shift; my $self = bless { @_, features => {}, rfeatures => [undef], }, $package; $self->_xs_init; $self->_param_init(@_); return $self; } my %params = map {$_,1} qw( type svm_c eps svm_costratio transduction_posratio biased_hyperplane sharedslack svm_maxqpsize svm_newvarsinqp kernel_cache_size epsilon_crit epsilon_shrink svm_iter_to_shrink maxiter remove_inconsistent skip_final_opt_check compute_loo rho xa_depth predfile alphafile kernel_type poly_degree rbf_gamma coef_lin coef_const custom ); sub _param_init { my ($self, %args) = @_; while (my ($k, $v) = each %args) { if (exists $params{$k}) { my $method = "set_$k"; $self->$method($v); } else { die "Unknown parameter '$k'\n"; } } } sub is_trained { my $self = shift; return exists $self->{_model}; } sub feature_names { my $self = shift; return keys %{ $self->{features} }; } sub predict { my ($self, %params) = @_; for ('attributes') { die "Missing required '$_' parameter" unless exists $params{$_}; } my (@values, @indices); while (my ($key) = each %{ $params{attributes} }) { push @indices, $self->{features}{$key} if exists $self->{features}{$key}; } @indices = sort {$a <=> $b} @indices; foreach my $i (@indices) { push @values, $params{attributes}{ $self->{rfeatures}[$i] }; } # warn "Predicting: (@indices), (@values)\n"; $self->predict_i(\@indices, \@values); } sub add_instance { my ($self, %params) = @_; for ('attributes', 'label') { die "Missing required '$_' parameter" unless exists $params{$_}; } for ($params{label}) { die "Label must be a real number, not '$_'" unless /^-?\d+(\.\d+)?$/; } my @values; my @indices; while (my ($key, $val) = each %{ $params{attributes} }) { unless ( exists $self->{features}{$key} ) { $self->{features}{$key} = 1 + keys %{ $self->{features} }; push @{ $self->{rfeatures} }, $key; } push @indices, $self->{features}{$key}; } @indices = sort { $a <=> $b} @indices; foreach my $i (@indices) { push @values, $params{attributes}{ $self->{rfeatures}[$i] }; } #warn "Adding document: (@indices), (@values) => $params{label}\n"; my $id = exists $params{query_id} ? $params{query_id} : 0; my $slack = exists $params{slack_id} ? $params{slack_id} : 1; my $cost = exists $params{cost_factor} ? $params{cost_factor} : 1.0; $self->add_instance_i($params{label}, "", \@indices, \@values, $id, $slack, $cost); } sub write_model { my ($self, $file) = @_; $self->_write_model($file); # Write a footer line if ( my $numf = keys %{ $self->{features} } ) { open my($fh), ">> $file" or die "Can't write footer to $file: $!"; print $fh ('#rfeatures: [undef, ' , join( ', ', map _escape($self->{rfeatures}[$_]), 1..$numf ), "]\n"); } } sub read_model { my ($self, $file) = @_; $self->_read_model($file); # Read the footer line open my($fh), $file or die "Can't read $file: $!"; local $_; while (<$fh>) { next unless /^#rfeatures: (\[.*\])$/; my $rf = $self->{rfeatures} = eval $1; die $@ if $@; $self->{features} = { map {$rf->[$_], $_} 1..$#$rf }; } } sub _escape { local $_ = shift; s/([\\'])/\\$1/g; s/\n/\\n/g; s/\r/\\r/g; return "'$_'"; } 1; __END__ =head1 NAME Algorithm::SVMLight - Perl interface to SVMLight Machine-Learning Package =head1 SYNOPSIS use Algorithm::SVMLight; my $s = new Algorithm::SVMLight; $s->add_instance (attributes => {foo => 1, bar => 1, baz => 3}, label => 1); $s->add_instance (attributes => {foo => 2, blurp => 1}, label => -1); ... repeat for several more instances, then: $s->train; # Find results for unseen instances my $result = $s->predict (attributes => {bar => 3, blurp => 2}); =head1 DESCRIPTION This module implements a perl interface to Thorsten Joachims' SVMLight package: =over 4 SVMLight is an implementation of Vapnik's Support Vector Machine [Vapnik, 1995] for the problem of pattern recognition, for the problem of regression, and for the problem of learning a ranking function. The optimization algorithms used in SVMlight are described in [Joachims, 2002a ]. [Joachims, 1999a]. The algorithm has scalable memory requirements and can handle problems with many thousands of support vectors efficiently. -- http://svmlight.joachims.org/ =back Support Vector Machines in general, and SVMLight specifically, represent some of the best-performing Machine Learning approaches in domains such as text categorization, image recognition, bioinformatics string processing, and others. For efficiency reasons, the underlying SVMLight engine indexes features by integers, not strings. Since features are commonly thought of by name (e.g. the words in a document, or mnemonic representations of engineered features), we provide in C a simple mechanism for mapping back and forth between feature names (strings) and feature indices (integers). If you want to use this mechanism, use the C and C methods. If not, use the C (or C) and C methods. =head1 INSTALLATION For installation instructions, please see the F file included with this distribution. =head1 METHODS =over 4 =item new(...) Creates a new C object and returns it. Any named arguments that correspond to SVM parameters will cause their corresponding C()> method to be invoked: $s = Algorithm::SVMLight->new( type => 2, # Regression model biased_hyperplane => 0, # Nonbiased kernel_type => 3, # Sigmoid ); See the C(...)> method for a list of such parameters. =item set_I<***>(...) The following parameters can be set by using methods with their corresponding names - for instance, the C parameter can be set by using C, where C<$x> is the new desired value. Learning parameters: type svm_c eps svm_costratio transduction_posratio biased_hyperplane sharedslack svm_maxqpsize svm_newvarsinqp kernel_cache_size epsilon_crit epsilon_shrink svm_iter_to_shrink maxiter remove_inconsistent skip_final_opt_check compute_loo rho xa_depth predfile alphafile Kernel parameters: kernel_type poly_degree rbf_gamma coef_lin coef_const custom For an explanation of these parameters, you may be interested in looking at the F file in the SVMLight distribution. It would be a good idea if you only set these parameters via arguments to C (see above) or right after calling C, since I don't think the underlying C code expects them to change in the middle of a process. =item add_instance(label => $x, attributes => \%y) Adds a training instance to the set of instances which will be used to train the model. An C parameter specifies a hash of attribute-value pairs for the instance, and a C