3 Configuration

Configuration files use a simple format consisting of either name/value pairs or complex variables in sections. Name/value pairs are encoded as single lines formated like ’name = value’. Complex variables are encoded as multiple lines in named sections delimited as in XML, using ’<name> ... </name>’. Sections may be nested for related configuration variables. Empty lines and lines starting with ’#’ (comments) are ignored.

The most important configuration variables are the complex variables <url><allow> (allows certain URLs to be harvested) and <url><exclude> (excludes certain URLs from harvesting) which are used to limit your crawl to just a section of the WWW, based on the URL. Loading URLs to be crawled into the system checks each URL first against the Perl regular expressions of <url><allow> and if it matches goes on to match it against <url><exclude> where it’s discarded if it matches, otherwise it’s scheduled for crawling. (See section 4.3 ’URL filtering’).

3.1 Configuration files

All configuration files are stored in the /etc/combine/ directory tree. All configuration variables have reasonable defaults (section 9).

3.1.1 Templates

The values in

job_default.cfg
contains job specific defaults. It is copied to a subdirectory named after the job by combineINIT.
SQLstruct.sql
contains structure of the internal SQL database used both for administration and for holding data records. Details in section A.4.
Topic_*
contains various contributed topic definitions.

3.1.2 Global configuration files

Files used for global parameters for all crawler jobs.

default.cfg
is the global defaults. It is loaded first. Consult section 9 and appendix A.3 for details. Values can be overridden from the job-specific configuration file combine.cfg.
tidy.cfg
configuration for Tidy cleaning of HTML code.

3.1.3 Job specific configuration files

The program combineINIT creates a job specific sub-directory in /etc/combine and populates it with some files including combine.cfg initialized with a copy of job_default.cfg. You should always change the value of the variable Operator-Email in this file and set it to something reasonable. It is used by Combine to identify you to the crawled Web-servers.

The job-name have to be given to all programs when started using the --jobname switch.

combine.cfg
the job specific configuration. It is loaded second and overrides the global defaults. Consult section 9 and appendix A.3 for details.
topicdefinition.txt
contains the topic definition for focused crawl if the --topic switch is given to combineINIT. The format of this file is described in section 4.5.1.
stopwords.txt
a file with words to be excluded from the automatic topic classification processing. One word per line. Can be empty (default) but must be present.
config_exclude
contains more exclude patterns. Optional, automatically included by combine.cfg. Updated by combineUtil.
config_serveralias
contains patterns for resolving Web server aliases. Optional, automatically included by combine.cfg. Updated by combineUtil.
sitesOK.txt
optionally used by the built-in automated classification algorithms (section 4.5) to bypass the topic filter for certain sites.

3.1.4 Details and default values

Further details are found in section 9 ’Configuration variables’ which lists all variables and their default values.