The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
NAME
    WWW::RoboCop - Police your URLs!

VERSION
    version 0.000002

SYNOPSIS
        use feature qw( state );

        use WWW::RoboCop;

        my $robocop = WWW::RoboCop->new(
            is_url_whitelisted => sub {
                state $count = 0;
                return $count++ < 5; # just crawl 5 URLs
            },
        );

        $robocop->crawl( 'http://host.myhost.com/start' );

        my %history = $robocop->get_report;

        # %history = (
        #    'http://myhost.com/one' => { status => 200, ... },
        #    'http://myhost.com/two' => { status => 404, ... },
        #    ...
        # )

DESCRIPTION
    BETA BETA BETA!

    "WWW::RoboCop" is a dead simple, somewhat opinionated robot. Given a
    starting page, this module will crawl only URLs which have been
    whitelisted by the "is_url_whitelisted" callback. It then creates a
    report of all visited pages, keyed on URL. You are encouraged to provide
    your own report creation callback so that you can collect all of the
    information which you require for each URL.

CONSTRUCTOR AND STARTUP
  new()
    Creates and returns a new "WWW::RoboCop" object.

    Below are the arguments which you may pass to "new" when creating an
    object.

   is_url_whitelisted
    This argument is required. You must provide an anonymous subroutine
    which will return true or false based on some arbitrary criteria which
    you provide. The two arguments to this anonymous subroutine will be a
    WWW::Mechanize::Link object as well as the referring URL, in the form of
    a URI object.

    Your sub might look something like this:

        use feature qw( state );

        use URI;
        use WWW::RoboCop;

        my $robocop = WWW::RoboCop->new(
            is_url_whitelisted => sub {
                my $link          = shift;
                my $referring_url = shift;

                my $upper_limit = 100;
                my $host = 'some.host.com';

                state $limit = 0;

                return 0 if $limit > $upper_limit;
                my $uri = URI->new( $link->url_abs );

                # if the referring_url matches the host then this is a 1st degree
                # outbound web link

                if ( $uri->host eq $host || $referring_url->host eq $host ) {
                    ++$limit;
                    return 1;
                }
                return 0;
            }
        );

   report_for_url
    This argument is not required, but is highly recommended. The arguments
    to this anonymous subroutine will be an HTTP::Response object as well as
    the referring URL in the form of a URI object. Your sub might look
    something like this:

        my $reporter = sub {
            my $response      = shift;    # HTTP::Response object
            my $referring_url = shift;    # URI object
            return {
                redirects => [
                    map { +{ status => $_->code, uri => $_->base->as_string } }
                        $res->redirects
                ],
                referrer => $referring_url,
                status   => $res->code,
            };
        };

        my $robocop = WWW::RoboCop->new(
            is_url_whitelisted => sub { ... },
            report_for_url     => $reporter,
        );

    That would give you a HashRef with the status code for each link visited
    (200, 404, 500, etc) as well as the referring URL (the page on which the
    link was found) and a list of any redirects which were followed in order
    to get to this URL.

    The default "report_for_url" sub will already provide something like the
    above, but you should only treat this as a stub method while you get up
    and running. Since it's only meant to be an example, the format of the
    default report could change at some future date without notice. You
    should not rely on or expect it to remain consistent in future. If you
    are going to rely on this module, you should provide your own reporting
    logic.

   ua( WWW::Mechanize )
    You can provide your own UserAgent object to this class. It should be of
    the WWW::Mechanize family. If you're looking for a significant speed
    boost while under development, consider providing a
    WWW::Mechanize::Cached object. This can give you enough of a speedup to
    save you from getting distracted and going off to read Hacker News while
    you wait.

        use CHI;
        use WWW::Mechanize::Cached;
        use WWW::RoboCop;

        my $cache = CHI->new(
            driver => 'File',
            root_dir => /tmp/cache-example',
        );

        my $robocop = WWW::RoboCop->new(
            is_url_whitelisted => sub { ... },
            ua => WWW::Mechanize::Cached->new( cache => $cache ),
        );

    If you're not using a Cached agent, be sure to disable autocheck.

        my $robocop = WWW::RoboCop->new(
            is_url_whitelisted => sub { ... },
            ua => WWW::Mechanize->new( autocheck => 0 ),
        );

  crawl( $url )
    This method sets the "WWW::RoboCop" in motion. The robot will only come
    to a halt once has exhausted all of the whitelisted URLs it can find.

  get_report
    This method returns a HashRef of crawling results, keyed on the URLs
    visited. By default, it returns a very simple Hash, containing only the
    status code of the visited URL. You are encouraged to provide your own
    callback so that you can get a detailed report returned to you. You can
    do this by providing a "report_for_url" callback when instantiating the
    object.

    The default report looks something like this:

        # %history = (
        #    'http://myhost.com/one' => { status => 200, ... },
        #    'http://myhost.com/two' => { status => 404, ... },
        # )

    See examples/crawl-host.pl, which is included with this distribution, to
    see a dump of the default report.

AUTHOR
    Olaf Alders <olaf@wundercounter.com>

COPYRIGHT AND LICENSE
    This software is Copyright (c) 2015 by MaxMind, Inc..

    This is free software, licensed under:

      The Artistic License 2.0 (GPL Compatible)