7 System components

All executables take a mandatory switch --jobname which is used to identify the particular crawl job you want as well as the job-specific configuration directory.

Briefly combineINIT is used to initialize SQL database and the job specific configuration directory. combineCtrl controls a Combine crawling job (start, stop, etc.) as well as printing some statistics. combineExport exports records in various XML formats and combineUtil provides various utility operations on the Combine database.

Detailed dependency information (section 10) can be found in the ’Gory details’ section.

In appendix (A.5) you’ll find all the man-pages collected.

7.1 combineINIT

Creates a MySQL database, database tables and initializes it. If the database exists it is dropped and recreated. A job-specific configuration directory is created in /etc/combine/ and populated with a default configuration file.

If a topic definition filename is given, focused crawling using this topic defintion is enabled per default. Otherwise focused crawling is disabled, and Combine works as a general crawler.

7.2 combineCtrl

Implements various control functionality to administer a crawling job, like starting and stopping crawlers, injecting URLs into the crawl queue, scheduling newly found links for crawling, controlling scheduling, etc. This is the preferred way of controling a crawl job.

7.3 combineUtil

Implements a number of utilities both for extracting information:

and for database maintenance:

7.4 combineExport

Export of structured records is done according to one of three profiles: alvis, dc, or combine. alvis and combine are very similar XML formats where combine is more compact with less redundancy and alvis contains some more information. dc is XML-encoded Dublin Core data.

The alvis profile format is defined by the Alvis Enriched Document XML Schema26.

For flexibility a switch --xsltscript adds the possibility to filter the output using a XSLT script. The script is fed a record according to the combine profile and the result is exported.

Switches --pipehost and --pipeport makes combineExport send it’s output directly to an Alvis27 pipeline reader instead of printing on stdout. This together with the switch --incremental, which just exports changes since the last invocation, provides an easy way of keeping an external system like Alvis or a Zebra28 database updated.

7.5 Internal executables and Library modules

combine is the main crawling machine in the Combine system and combineRun starts, monitors and restarts combine crawling processes.

7.5.1 Library

Main, crawler-specific, library components are collected in the Combine:: Perl name-space.