INSTALL - install help from scratch
DESCRIPTION
This document helps install PDF::OCR2 and like modules.
Perl is not the fastest medium- but it is convenient. As of this
writing, I work in a small accounting firm. We have a small IT
department that maintains a number of projects that make life easier- if
not just bearable, for human beings. I am able to maintain some fairly
decent and useful projects. Most are relatively complex- and without
perl, I would not be able to do so.
One of the the things I've worked on a are a set of perl modules to
facilitate interaction with ocr engines. PDF::OCR2 allows you to get
text out of a pdf document, and if there is not text, we call an ocr
engine to do work for us. This has proven invaluable at my office. We
can look inside the many thousands of documents without altering them in
any way.
I have received various emails for help on installing PDF::OCR2,
Image::OCR::Tesseract, etc. The inquiries vary from 'will it work on
windows?' to 'why won't it install?'
This document contains the most common answers to these questions.
MAKE SURE YOU ARE READY TO INSTALL
You will need a decent computer and operating system. You will need root
access, access to cpan via command line, possibly a package management
system such as aopt-get, yum, etc. You will be compiling a thing or two
from source.
posix operating system
You need a posix operating system. These are unixes and linuxes.
I've had excellent results installing on Fedora, Debian, and Ubuntu
servers.
hardware
I can't stress enough how much imaging procedures will abuse hardware.
Memory is not very important. The cpu, however.. Is very important.
I would not suggest a production server of anything less than a 1.2Ghz
machine. Overall, I get good results on 64bit architecture vs 32bit.
Ideally speaking, I would have access to a IBM mainframe- but- I don't.
The best I get my hands on recently are dual core pentium IVs, they're
really not bad. If your company or organization is willing to devote a
beefy server to manage ocr and imaging tasks, great. Otherwise, the
aforementioned machines will do well.
cpan, up to date
A lot of the requirements here are perl modules. You will be using cpan
via the command line. Having command line cpan access seems pretty
standard, I've never seen a unix box without it.
Old cpan commands worked as:
cpan install Module::Name
New cpan commands look like:
cpan Module::Name
You can update cpan by saying:
cpan install CPAN
root access
The installation procedures in this document assume you are logged in as
root.
ubuntu, enabling root access
By default, on ubuntu, the root account is disabled. It is suggested you
enable root access. I understand the reasoning- personally- I don't like
sudo. It seems they've disabled it by simply not providing a password
for the root account. How 'clever'.
As of Ubuntu version 8.04, you need to enable root access by providing a
root password for the account.
$ sudo passwd root
It will ask you for what you want as password.
INSTALL NON PERL DEPENDENCIES
gcc-c++ and automake
fedora
$ yum install -y gcc-c++
$ yum install -y automake
ubuntu
$ apt-get install gcc-c++
$ apt-get install automake
imagemagick
fedora
$ yum install -y imagemagick
ubuntu
$ apt-get install imagemagick
tesseract
Installing tesseract can be tricky. I don't know of a rpm or debian
package for this one. You'll very likely have to install this from
source. Make sure you have gcc-c++ and automake installed on your
system- id you don't know you can proceed, but if you suffer any errors,
simple go back, install gcc-c++ and automake, and try again.
You may be able to simply install the SVN version of tesseract this way:
$ cd /tmp
$ svn checkout http://tesseract-ocr.googlecode.com/svn/trunk/ tesseract-ocr
$ cd tesseract-ocr
$ ./runautoconf
$ mkdir build-directory
$ cd build-directory
$ ../configure
$ make
$ make install
For more info, see google project on ocr, they use tesseract.
INSTALL PERL MODULES
Ideally, you could simply say:
cpan PDF::OCR2
And, voila- done. And potentially, this might work. If no, I suggest to
install perl modules in following similar order..
perl modules install order
$ cpan PDF::API2
$ cpan CAM::PDF
$ cpan PDF::Burst
$ cpan PDF::GetImages
$ cpan Image::OCR::Tesseract
$ cpan PDF::OCR2
Image::OCR::Tesseract
If the command 'cpan Image::OCR::Tesseract' fails.. You will need to
download the package and install manually from distro.
$ cd /tmp
$ wget http://search.cpan.org/src/LEOCHARRE/Image-OCR-Tesseract-1.22/
$ tar -xvf Image-OCR-T(tab completion)
$ cd Image-OCR-T(tab completion)
$ perl Makefile.PL # or you can do perl t/00(tab completion)
This will check for image libraries and ocr engine. You will need to
have already installed imagemagick and tesseract, as mentioned in this
document.
Make sure you are getting the latest version of Image::OCR::Tesseract,
the above example is for version 1.22. I update frequently- so make
sure. You can search for the latest version by going to
http://search.cpan.org and search for 'Image::OCR::Tesseract'.
There are INSTALL.* readme files in the package Image::OCR::Tesseract
that may want to look through.
Image::Magick
Should already be available ( via previously installing imagemagick ).
BUGS
I am very open to corrections, suggestions, hints, tips, criticism. I am
not a know-it-all, I have been able to do and share some useful things
because of what I learn every day from my peers. Please contact the
AUTHOR.
AUTHOR
Leo Charre leocharre at cpan dot org