3 Configuration

Configuration files use a simple format consisting of either name/value pairs or complex variables in sections. Name/value pairs are encoded as single lines formated like ’name = value’. Complex variables are encoded as multiple lines in named sections delimited as in XML, using ’<name> ... </name>’. Sections may be nested for related configuration variables. Empty lines and lines starting with ’#’ (comments) are ignored.

The most important configuration variables are the complex variables <url><allow> (allows certain URLs to be harvested) and <url><exclude> (excludes certain URLs from harvesting) which are used to limit your crawl to just a section of the WWW, based on the URL. Loading URLs to be crawled into the system checks each URL first against the Perl regular expressions of <url><allow> and if it matches goes on to match it against <url><exclude> where it’s discarded if it matches, otherwise it’s scheduled for crawling. (See section 4.3 ’URL filtering’).

3.1 Configuration files

All configuration files are stored in the /etc/combine/ directory tree. All configuration variables have reasonable defaults (section 9).

3.1.1 Templates

The values in

contains job specific defaults. It is copied to a subdirectory named after the job by combineINIT.
contains structure of the internal SQL database used both for administration and for holding data records. Details in section A.4.
contains various contributed topic definitions.

3.1.2 Global configuration files

Files used for global parameters for all crawler jobs.

is the global defaults. It is loaded first. Consult section 9 and appendix A.3 for details. Values can be overridden from the job-specific configuration file combine.cfg.
configuration for Tidy cleaning of HTML code.

3.1.3 Job specific configuration files

The program combineINIT creates a job specific sub-directory in /etc/combine and populates it with some files including combine.cfg initialized with a copy of job_default.cfg. You should always change the value of the variable Operator-Email in this file and set it to something reasonable. It is used by Combine to identify you to the crawled Web-servers.

The job-name have to be given to all programs when started using the --jobname switch.

the job specific configuration. It is loaded second and overrides the global defaults. Consult section 9 and appendix A.3 for details.
contains the topic definition for focused crawl if the --topic switch is given to combineINIT. The format of this file is described in section 4.5.1.
a file with words to be excluded from the automatic topic classification processing. One word per line. Can be empty (default) but must be present.
contains more exclude patterns. Optional, automatically included by combine.cfg. Updated by combineUtil.
contains patterns for resolving Web server aliases. Optional, automatically included by combine.cfg. Updated by combineUtil.
optionally used by the built-in automated classification algorithms (section 4.5) to bypass the topic filter for certain sites.

3.1.4 Details and default values

Further details are found in section 9 ’Configuration variables’ which lists all variables and their default values.