. Called in array context, also returns the I-value itself. Same options and conditions as above. Other options are B (for the z_value) and B (for the p_value).
=cut
sub z_value {
my $self = shift;
my $args = ref $_[0] ? shift : {@_};
$args->{'tails'} ||= 1;
my $pval = $self->p_value($args);
require Statistics::Zed;
my $zed = Statistics::Zed->new();
my $zval = $zed->p2z(value => $pval, %$args);
return wantarray ? ($zval, $pval) : $zval;
}
*vzs = \&z_value;
*vnomes_zscore = \&z_value;
*zscore = \&z_value;
=head2 p_value, test, vnomes_test, vnt
$p = $vnomes->p_value(); # using loaded data and default args
$p = $vnomes->p_value(data => [1, 0, 1, 1, 0], exact => 1); # using given data (by-passing load and read)
$p = $vnomes->p_value(trials => 20, observed => 10); # without using data
Returns probability of obtaining the psisq value for data already Led, or directly keyed as B. The I

-value is read off the complemented chi-square distribution (incomplete gamma integral) using L C.
=cut
sub p_value {
my $self = shift;
my $args = ref $_[0] ? shift : {@_};
my ($psisq, $df, $p_value) = $self->psisq($args);
$p_value = Math::Cephes::igamc($df/2, $psisq/2);
$args->{'tails'} ||= 1;
$p_value /= 2 if $args->{'tails'} == 1;
return $p_value;
}
*test = \&p_value;
*vnomes_test = \&p_value;
*vnt = \&p_value;
=head2 dump
$vnomes->dump(length => 3, values => {psisq => 1, p_value => 1}, format => 'table|labline|csv', flag => 1, precision_s => 3, precision_p => 7, verbose => 1, tails => 1);
Print Vnome-test results to STDOUT. See L in the Statistics::Sequences manpage for details. If B => 1, then you get (1) the actual test-statistic depending on the value of B tested (I for the second difference measure (default), I for the first difference measure, and I for the raw measure), followed by degrees-of-freedom in parentheses; and (2) a warning, if relevant, that your B value might be too large with respect to the sample size (see NIST reference, above, in discussing B). If B => 1, you just get the average observed and expected frequencies for each v-nome, the I-value, and its associated I

-value.
=cut
sub dump {
my $self = shift;
my $args = ref $_[0] ? $_[0] : {@_};
$args->{'stat'} = 'vnomes';
$args->{'stat'} .= " ($args->{'length'})" if defined $args->{'length'};
$self->SUPER::dump($args);
if ($args->{'verbose'} && $args->{'format'} eq 'table') {
if ($args->{'values'}->{'psisq'}) {
my $delta = defined $args->{'delta'} ? $args->{'delta'} : 2;
my ($psisq, $df) = $self->psisq($args);
my $df_str = 'degree';
$df_str .= 's' if $df != 1;
if ($delta == 0) { # Raw psisq:
print "psisq is the Kendall-Babington Smith statistic, calculated without backward differencing, and has $df $df_str of freedom.\n";
}
elsif ($delta == 1) { # First backward difference:
print "psisq is Good's delta-psi^2, calculated with first backward differences, and has $df $df_str of freedom.\n";
}
else {
print "psisq is Good's delta^2-psi^2, calculated with second backward differences, and has $df $df_str of freedom.\n";
}
}
if ($args->{'values'}->{'p_value'}) {
print "p-value is $args->{'tails'}-tailed.\n";
}
}
return $self;
}
=head2 nnomes
$r = $vnomes->nnomes(length => int, states => \@ari); # minimal option required; the "v" value itself, with number of states
$r = $vnomes->nnomes(length => int); # minimal option required, assuming data are loaded so can read its number of states
$r = $vnomes->nnomes(length => int, data=> \@data); # minimal option required; the "v" value itself, with these data
Returns the number of possible subsequences of the given length (I) for the given number of states (I). This is quantity denoted as I in Good's (1953, 1957) papers; i.e.,
=for html

*r[V]* = *t*^{v}
The routine needs to know two things: the "v" value itself, i.e., the length of the possible subsequences to test (Inomes, Inomes, Inomes, etc.), and the number of states (events, letters, etc.) that the process generating the data could take (from 1 to whatever). The former is always required to be specified by the named argument B. The latter (the number of states) can be directly taken from the named argument B, indirectly from the size of the array referenced to the named argument B, from whatever states exist in the array referenced to the named argument B, or from whatever states exist in data already Led.
=cut
sub nnomes {
my $self = shift;
my $args = ref $_[0] ? $_[0] : {@_};
my $t = _get_t($self, $args);
my $v = $args->{'length'} || croak __PACKAGE__, '::nnomes Must define argument \'length\' with a value greater than zero';
return $t**$v;
}
=head2 prob_r
$Pr = $vnomes->prob_r(length => $v); # length is 1 (mononomes) by default
Returns the probability of the occurrence of any of the individual elements ("digits") in the sequence (I = 1), or of the given B, assuming they are equally likely and independent.
=for html

*P*_{r} = *t*^{–v}
=cut
sub prob_r {
my $self = shift;
my $args = ref $_[0] ? $_[0] : {@_};
my $t = _get_t($self, $args);
my $v = $args->{'length'} || 1;
return $t**(-1 * $v);
}
=head1 OPTIONS
Options common to the above stats methods.
=head2 length
This is currently a I "option", giving the length of the v-nome of interest, i.e., the value of I - an integer greater than or equal to 1, and smaller than than the sample-size.
What is a meaningful maximal value of B? As a I-square test, it is conventionally required that the expected frequency is at least 5 for each v-nome (Knuth, 1988). This can be judged to be too conservative (Delucchi, 1993). The NIST documentation on the serial test (Rukhin et al., 2010) recommends that length should be less than the floored value of log2 of the sample-size, minus 2. No tests are here made of these recommendations.
=head2 circularize
By default, L and L counts, and the value of L, are made by treating the sequence as a cyclic one, where the first element of the sequence follows the last one. This affects (and simplifies) the calculation of the expected frequency of each v-nome, and so the value of each psi-square. Also, circularizing ensures that the expected frequencies are accurate; otherwise, they might only be approximate. As Good and Gover (1967) state, "It is convenient to circularize in order to get exact checks of the arithmetic and also in order to simplify some of the theoretical formulae" (p. 103). These methods, however, can also treat the sequence non-cyclically by calling them with B => 0.
=head2 states
Optionally send a referenced array listing the unique states (or 'events', 'letters') in the population from which the sequence was sampled, e.g., B => [qw/A C G T/]. This is useful if the sequence itself might not include all the possible states. If this is not specified, the states are identified from the sequence itself. If giving a list of states, a check in each test is made to ensure that the sequence contains I those elements in the list.
=cut
# PRIVATMETHODEN
sub _count_lim { # N - v + 1; if N = 5 and v = 3, starting the count-up for v-nomes must stop from the first 3 elements of the data
return $_[0] - $_[1] + 1;
}
sub _get_n { # the size of the sequence
my ($self, $args) = @_;
my $data = ref $args->{'data'} ? $args->{'data'} : $self->read($args);
return scalar @$data;
}
sub _get_t { # the number of unique states; e.g., 2 for a binary sequences; 10 for a typical random digit sequence (0 .. 9)
my ($self, $args) = @_;
my $t;
if ($args->{'nstates'}) {
$t = $args->{'nstates'};
}
elsif (ref $args->{'states'} and ref $args->{'states'} eq 'ARRAY') {
$t = scalar(@{$args->{'states'}});
}
else {
my $data = ref $args->{'data'} ? $args->{'data'} : $self->read($args);
my %hash = map { $_, 1 } @{$data};
my $states = [keys %hash];
$t = scalar(@{$states});
}
return $t;
}
sub _get_stateslist {
my ($data, $states) = @_;
if (! ref $states) { # Get states from the data themselves:
my %hash = map { $_, 1 } @{$data};
$states = [keys %hash];
}
else { # Ensure that the data only contain states in the given list:
my ($g, $h) = ();
DATA:
foreach $g(@{$data}) {
foreach $h(@{$states}) {
next DATA if $h eq $g;
}
croak __PACKAGE__, "::test The element $g in the data is not represented in the given states";
}
}
#croak __PACKAGE__, '::test At least two different values must be in the sequence to test its sub-sequences' if $t <= 1;
return $states;
}
sub _frequencies {
my ($data_i, $v_i, $states_aref) = @_;
# Get a list of all possible combinations of states at the current length ($v_i):
my @variations = variations_with_repetition($states_aref, $v_i);
# Count up the frequency of each variation in the data:
my $num = scalar(@{$data_i});
my ($i, $probe_str, $test_str, %r_freq) = ();
foreach (@variations) {
$probe_str = join'', @{$_};
$r_freq{$probe_str} = 0;
for ($i = 0; $i < _count_lim($num, $v_i); $i++) {
$test_str = join'', @{$data_i}[$i .. ($i + $v_i - 1)];
$r_freq{$probe_str}++ if $probe_str eq $test_str;
}
}# print "FREQ:\n"; while (my($key, $val) = each %freq) { print "\t$key = $val\n"; } print "\n";
return \%r_freq;
}
sub _sel_psisq_and_df {# get psisq and its df
my ($v, $t, $psisq_v, $delta) = @_;
my ($psisq, $df) = ();
if ($v == 1) { # psisq is asymptotically distributed chisq, can use psisq for chisq distribution:
$psisq = $psisq_v->{1};
$df = $t - 1;
}
else {
if ($delta == 0) { # Raw psisq:
$psisq = $psisq_v->{$v};
$df = $t**$v - 1; # Good (1957, Eq. 6) - if circularized or not
}
elsif ($delta == 1) { # First backward difference:
$psisq = $psisq_v->{$v} - ($v - 1 <= 0 ? 0 : $psisq_v->{$v - 1});
$df = $t**$v - $t**($v - 1);
}
else { # $delta == 2 # Second backward difference (default):
$psisq = $psisq_v->{$v} - ( 2 * ($v - 1 <= 0 ? 0 : $psisq_v->{$v - 1}) ) + ($v - 2 <= 0 ? 0 : $psisq_v->{$v - 2});
$df = ( $t**($v - 2) ) * ( $t - 1)**2;
}
}
return ($psisq, $df);
}
sub _get_v_ari {
my $v = shift; # Init a hash to keep the psi-square values for the v, v-1, and v-2 ( = $v_i) sequence lengths, where relevant:
my @ari = ();
foreach (0 .. 2) {
$v > $_ ? push @ari, $v - $_ : last;
} # print "v ari = ", join(' ', @ari), "\n"; #push @ari, $v - 1 if $v >= 2; #push @ari, $v - 2 if $v >= 3;
return \@ari;
}
sub _psisq_uncirc {# Compute psi^2 for uncircularized sequence from freq of each variation of length $v for given states(Good, 1953, Eq. 1):
my ($n, $t, $v, $r_freq) = @_;
my $k = ($n - $v + 1) / $t**$v;
my $sum = sum( map{ ($_ - $k)**2 } values %{$r_freq});#foreach (keys %{$r_freq}) {$psisq += ($r_freq->{$_} - $k)**2;}
return $sum / $k;
}
sub _psisq_circ {
my ($n, $t, $v, $r_freq) = @_;# Compute psi^2 for circularized sequence from freq of each variation of length $v for given states(Good, 1953, Eq. 2):
my $k = $n * $t**(-1 * $v);
my $sum = sum( map{ ($_ - $k) } values %{$r_freq});#foreach (keys %{$r_freq}) { $sum += ($r_freq->{$_} - $k)**2;}
return $sum / $k;
}
__END__
=head1 EXAMPLE
=head2 Seating at the diner
This is the data from Swed and Eisenhart (1943) also given as an example for the L and L. It lists the occupied (O) and empty (E) seats in a row at a lunch counter.
Have people taken up their seats on a random basis - or do they show some social phobia, or are they trying to pick up? What does the test of Vnomes reveal?
use Statistics::Sequences::Vnomes;
my $vnomes = Statistics::Sequences::Vnomes->new();
my @seating = (qw/E O E E O E E E O E E E O E O E/);
$vnomes->load(\@seating);
$vnomes->dump(length => 3, values => {z_value => 1, p_value => 1}, format => 'labline', flag => 1, precision_s => 3, precision_p => 3, tails => 1);
This prints:
z_value = 2.015, p_value = 0.022*
That is, the observed frequency of each possible trio of seating arrangements (the trinomes OOO, OOE, OEE, EEE, etc.) differed significantly from that expected. Look up the observed frequencies for each possible trinome to see if this is because there are more empty or occupied neighbouring seats ("phobia" or "philia"):
$vnomes->dump(length => 3, values => {observed => 1}, format => 'labline');
This prints:
observed = ('OEE' = 4,'EEO' = 4,'EEE' = 2,'OEO' = 1,'EOE' = 5,'OOO' = 0,'OOE' = 0,'EOO' = 0)
As the chance-expected frequency is 2.5 (from the L method), there are clearly more than expected trinomes involving empty seats than occupied seats - suggesting a non-random factor like social phobia (or body odour?) is at work in sequencing people's seating here. Noting that the sequencing isn't significant for dinomes (with B => 2) might also tell us something about what's going on. What happens for v-nomes of 4 or more in length? Maybe the L or L test might be a better summary of what's going on.
=head1 REFERENCES
Davis, J. W., & Akers, C. (1974). Randomization and tests for randomness. I, I<38>, 393-407.
Delucchi, K. L. (1993). The use and misuse of chi-square: Lewis and Burke revisited. I, I<94>, 166-176.
Gatlin, L. L. (1979). A new measure of bias in finite sequences with applications to ESP data. I, I<73>, 29-43. (Used for one of the reference tests in the CPAN distribution.)
Good, I. J. (1953). The serial test for sampling numbers and other tests for randomness. I, I<49>, 276-284.
Good, I. J. (1957). On the serial test for random sequences. I, I<28>, 262-264.
Good, I. J., & Gover, T. N. (1967). The generalized serial test and the binary expansion of [square-root]2. I, I<130>, 102-107.
Kendall, M. G., & Babington Smith, B. (1938). Randomness and random sampling numbers. I, I<101>, 147-166.
Knuth, D. E. (1998). I (3rd ed., Vol. 2 Seminumerical algorithms). Reading, MA, US: Addison-Wesley.
Rukhin, A., Soto, J., Nechvatal, J., Smid, M., Barker, E., Leigh, S., et al. (2010). A statistical test suite for random and pseudorandom number generators for cryptographic applications. Retrieved September 4 2010, from L, and July 17, 2013, from L (revised).
=head1 SEE ALSO
L sub-modules for other tests of sequences, and for sharing data between these tests.
=head1 TO DO/BUGS
Handle non-overlapping v-nomes.
=head1 AUTHOR/LICENSE
=over 4
=item Copyright (c) 2006-2013 Roderick Garton
rgarton AT cpan DOT org
This program is free software. It may be used, redistributed and/or modified under the same terms as Perl-5.6.1 (or later) (see L).
=back
=head1 DISCLAIMER
To the maximum extent permitted by applicable law, the author of this module disclaims all warranties, either express or implied, including but not limited to implied warranties of merchantability and fitness for a particular purpose, with regard to the software and the accompanying documentation.
=head1 END
This ends documentation of the Perl implementation of the I-square statistic, Kendall-Babington Smith test, and Good's Generalized Serial Test, for randomness in a sequence.
=cut