The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
#!/usr/bin/env perl
use warnings;
use strict;
use Getopt::Long;
use Regexp::Common;
use WWW::Mechanize;
use Test::WWW::Simple;

use constant SHEBANG => '#!/usr/bin/env perl';
my($generate, $run, $warn);

GetOptions('generate' => \$generate,
           'run'      => \$run,
           'warn'     => \$warn);

# Assume run if no flags at all.
$run++ unless $run or $generate;

my @tests;
my @countries;
my $agent = "Windows IE 6";
my $number_of_tests;
my @lines = <>;
while(@lines) {
  $_ = shift @lines;
  chomp;
  # Discard comments.
  /^#/ and next;

  # Discard blank lines.
  /^\s*$/ and next;

  # First, look for any of our pragmas.
  # These all start with %% and a pragma name.

  # %%cache
  # This pragma tells simple_scan to cache all URLs.
  /^%%cache/ and do {
    push @tests, "cache();\n";
    next;
  };

  # %%nocache
  # This pragma tells simple_scan to not cache URLs.
  /^%%nocache/ and do {
    push @tests, "no_cache();\n";
    next;
  };

  # %xx: es de au ..
  # This pragma lets us define a list of country IDs
  # to be substituted into the URLs following it. We look for ">xx<"
  # in the URLs to find the characters we'll replace.
  /^%%xx:/ and do {
    if (/^%%xx: (([a-z1-9]{2,}(\s+|$))+)/) {
      my $countries = $1;
      @countries = split(/\s+/, $countries);
    }
    else {
      die "Invalid '%%xx:' pragma: must be '%%xx: ' followed by two (or more) character country IDs.\n";
    }
    next;
  };

  # %agent: agent_alias
  # This pragma tells us to switch the user agent to the one specified.
  # if it's not a valid user agent, we die.
  /%%agent: (.*)/ and do {
    $agent = $1;
    my @aliases = WWW::Mechanize::known_agent_aliases(); 
    unless (grep {/$agent/} @aliases) {
      die "$agent is an invalid user agent alias/";
    }
    push @tests, qq!Test::WWW::Simple::user_agent("$agent");\n!;
    next;
  };

  # if an xx: pragma is in effect, substitute the country for ">xx<" 
  # everywhere in the input record (so we can include the country in the
  # comment!).
  if (@countries) {
    my @localized;
    for my $country (@countries) {
      my $localized;
      # don't create localized tests if the test is not localizable.
      unless (/>xx</) {
        emit_a_test($_); 
        last;
      }
      # Localize.
      ($localized = $_) =~ s/>xx</$country/g;
      push @localized, $localized;
    }
    emit_a_test(@localized);

    # Discard the unlocalized line.
    next;
  }

  # No localization in effect. Just make a test.
  else {
    emit_a_test($_);
  }

}

if (defined $number_of_tests) {
  unshift @tests, 
    "@{[SHEBANG]}\nuse Test::WWW::Simple tests=>$number_of_tests;\n";
  print @tests if $generate;
  eval(join '',@tests) if $run;
  $@ and warn $@,"\n";
} 
else {
  warn "# No tests were found in your input file.\n"
    if $warn;
}

sub emit_a_test {
  my (@input) = @_;
  # All these extra undefs are the result of using Regexp::Common to 
  # capture the first two fields. Less confusing and error-prone than
  # actually coding the regexes oneself.
  local $_;
  for (@input) {
    s/>agent</$agent/;
    my($url,   undef, undef, undef, undef, undef, undef, undef,
       undef, undef, $regex, undef, 
       undef, $switches,
       $which, 
       @comment) =
      m[$RE{URI}{HTTP}{-keep}\s+              # a URL
        $RE{delimited}{-delim=>'/'}{-keep}    # a regex, in slashes
        (([sixogim]+)\s+|\s+)                 # possibly with switches
        ([yY]|[nN])\s+                              # should/shoudn't match
        (.*)$]x;                              # test comment

    # Clean things up a bit.

    # Make sure the "which" is uppercase for later tests.
    $which = uc($which) if defined $which;

    # Add url and which way the pattern should match to the comment.
    push @comment, "[$url]" if defined $url;

    # avoid a warning if the test spec had a syntax error.
    push @comment, "[/$regex/ " . ($which eq 'Y' ? "should" : "shouldn't") . " match]"
     if defined $regex and defined $which;

    # Avoid undef warnings if no regex switches were used.
    $switches = "" unless defined $switches;

    # Warn about possible errors.  
    unless (defined $url and defined $regex and defined $which) {
      push @tests, "# $_\n","# Possible syntax error in this test spec\n";
      warn "$_: syntax error" if $warn;
      next;
    }

    push @tests, qq!page_@{[$which eq 'Y' ? "" : "un"]}! .
                 qq!like("$url", qr/$regex/$switches, "@{[join " ",@comment]}");\n!;
    $number_of_tests++;
  }
}
__END__

=head1 NAME

simple_scan - scan a set of Web pages for strings present/absent

=head1 SYNOPSIS

  simple_scan [--generate] [--run] {file file file ...}

=head1 USAGE

  # Run the tests in the files supplied on the command line.
  # --run (or -run; we're flexible) is assumed if you give no switches.
  % simple_scan file1 file2 file3

  # Generate a set of tests and save them, then run them.
  % <complex pipe> | simple_scan --generate > pipe_scan.t

  # Run one simple test
  % echo "http://yahoo.com yahoo Y Look for yahoo.com"  | simple_scan -run

=head1 DESCRIPTION

C<simple_scan> reads either files supplied on the command line, or standard
input. It creates and runs, or prints, or even both, a L<Test::WWW::Simple>
test for the criteria supplied to it.

C<simple_scan>'s input should be in the following format:

  <URL> <pattern> <Y|N> <comment>

The I<URL> is any URL; I<pattern> is a Perl regular expression, delimited by
slashes; I<Y|N> is C<Y> if the pattern should match, or C<N> if the pattern 
should B<not> match; and I<comment> is any arbitrary text you like (as long as it's all on the same line as everything else).

=head1 COMMAND-LINE SWITCHES

We use L<Getopt::Long> to get the command-line options, so we're really very
flexible as to how they're entered. You can use either one dash (as in
C<-foo>) or two (as in C<--bar>). You only need to enter the minimum number
or characters to match a given switch.

=over 4

=item C<--run>

C<--run> tells C<simple_scan> to immediately run the tests it's created. Can
be abbreviated to C<-r>.

This option is mosst useful for one-shot tests that you're not planning to
run repeatedly.

=item C<--generate>

C<--generate> tells C<simple_scan> to print the test it's generated on the
standard output.

This option is useful to build up a test suite to be reused later.

=item C<--warn>

C<--warn> turns on the reporting of minor problems (at present, only
"there are no tests in your input file"). 

Useful if you're autogenerating tests and want to make sure that
in fact you did actually generate some.

=back

Both C<-r> and C<-g> can be specified at the same time to run a test and print 
it simultaneously; this is useful when you want to save a test to be run later 
as well as right now without having to regenerate the test.

=head1 PRAGMAS

Pragmas are ways to influence what C<simple_scan> does when generating tests.
They don't output anything themselves.

Pragmas are specified with C<%%> in column 1 and the pragma name immediately
following. Any arguments are supplied after a colon, like this:

   %%foo: bar baz

This invokes the C<foo> pragma with the argument C<bar baz>.

=head2 cache

This turns on C<Test::WWW::Simple>'s internal URL caching. If you run more than
one test against a specific URL, C<Test::WWW::Simple> will reuse its 
internally-saved version of the page instead of refetching it. Useful for speed
if you know your URLs shouldn't change on refetch, or if you want to "freeze"
the page for a while to run several different tests on it (for example, to
check for internal consistency).

=head2 nocache

This is the default; if you reaccess a page, C<Test::WWW::Simple> goes out and
gets it again. Slower, but if you really expect your page to change when it's
fetched again and you want this to happen, specify C<%%nocache>.

=head2 xx

The C<xx> pragma allows for very simple-minded internationalization. It assumes
that you want to substitute each of a list of two-character country codes into
a string (most likely somewhere in the URL, but possibly in the comment too). 
C<simple_scan> will do this for you, creating a test for each country code
you specify. For instance:

   %%xx: es au my jp
   http://>xx<.mysite.com/     /blargh/  Y  look for blargh (>xx<)

This would generate 4 tests, for C<es.mysite.com>, C<au.mysite.com>, 
c<my.mysite.com>, and C<jp.mysite.com>, all looking to match C<blargh> 
somewhere on the page.

=head2 agent

The C<agent> pragma allows you to switch user agents during the test. 
C<Test::WWW::Simple>'s default is C<Windows IE 6>, but you can switch it
to any of the other user agents supported by C<WWW::Mechanize>.

   http://gemal.dk/browserspy/basic.html /Explorer/ Y Should be Explorer
   %%agent: Mac Safari
   http://gemal.dk/browserspy/basic.html /Safari/ Y Should be Safari

=head1 AUTHOR

Joe McMahon E<lt>mcmahon@yahoo-inc.comE<gt>

=head1 COPYRIGHT AND LICENSE

Copyright (c) 2005 by Yahoo!

This script is free software; you can redistribute it or modify it under the
same terms as Perl itself, either Perl version 5.6.1 or, at your option, any
later version of Perl 5 you may have available.