Parse-MediaWikiDump
Parse::MediaWikiDump is a collection of classes for processing various
MediaWiki dump files such as those at
http://download.wikimedia.org/wikipedia/en/; the package requires XML::Parser.
Using this software it is nearly trivial to get access to the information in
supported dump files.
Currently the following dump files are supported:
* Current page dumps for all languages
* Current links dumps for all languages
INSTALLATION
To install this module, run the following commands:
perl Makefile.PL
make
make test
make install
EXAMPLE
Extract the text for a given article from the given dump file:
#!/usr/bin/perl
use strict;
use warnings;
use Parse::MediaWikiDump;
my $file = shift(@ARGV) or die "must specify a MediaWiki dump of the current pages";
my $title = shift(@ARGV) or die "must specify an article title";
my $dump = Parse::MediaWikiDump::Pages->new($file);
binmode(STDOUT, ':utf8');
binmode(STDERR, ':utf8');
#this is the only currently known value but there could be more in the future
if ($dump->case ne 'first-letter') {
die "unable to handle any case setting besides 'first-letter'";
}
#enforce the MediaWiki case rules
$title = case_fixer($title);
#iterate over the entire dump file, article by article
while(my $page = $dump->next) {
if ($page->title eq $title) {
print STDERR "Located text for $title\n";
my $text = $page->text;
print $$text;
exit 0;
}
}
print STDERR "Unable to find article text for $title\n";
exit 1;
#removes any case sensativity from the very first letter of the title
#but not from the optional namespace name
sub case_fixer {
my $title = shift;
#check for namespace
if ($title =~ /^(.+?):(.+)/) {
$title = $1 . ':' . ucfirst($2);
} else {
$title = ucfirst($title);
}
return $title;
}
COPYRIGHT & LICENSE
Copyright 2005 Tyler Riddle, all rights reserved.
This program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.