9 Configuration variables

9.1 Name/value configuration variables

9.1.1 analysePlugin

Used by:
utilPlugIn.pm.svn-base; utilPlugIn.pm

9.1.2 AutoRecycleLinks

Default value
= 1
Description:
Enable(1)/disable(0) automatic recycling of new links
Used by:
SD_SQL.pm.svn-base; SD_SQL.pm

9.1.3 baseConfigDir

Default value
= /etc/combine
Description:
Base directory for configuration files; initialized by Config.pm
Used by:
combineExport.svn-base; FromHTML.pm; combineExport; FromHTML.pm.svn-base
Set by:
Config.pm; Config.pm.svn-base

9.1.4 classifyPlugIn

Default value
= Combine::Check_record
Description:
Which topic classification PlugIn module algorithm to use
Combine::Check_record and Combine::PosCheck_record included by default
NEW SVM classifier: Combine::classifySVM
see classifyPlugInTemplate.pm and documentation to write your own
Used by:
combineReClassify.svn-base; combine; combineReClassify; combine.svn-base

9.1.5 configDir

Default value
= NoDefaultValue
Description:
Directory for job specific configuration files; taken from ’jobname’
Used by:
classifySVM.pm.svn-base; combineUtil; utilPlugIn.pm.svn-base; classifySVM.pm; PosCheck_record.pm; Check_record.pm; combineCountry.pl; combineCountry.pl.svn-base; Check_record.pm.svn-base; PosCheck_record.pm.svn-base; utilPlugIn.pm; combineUtil.svn-base
Set by:
Config.pm; Config.pm.svn-base

9.1.6 doAnalyse

Default value
= 1
Description:
Enable(1)/disable(0) analysis of genre, language
Used by:
combine; combine.svn-base

9.1.7 doCheckRecord

Description:
Enable(1)/disable(0) topic classification (focused crawling)
Generated by combineINIT based on –topic parameter
Used by:
combine; combine.svn-base; combineReClassify.svn-base; combineReClassify

9.1.8 doOAI

Default value
= 1
Description:
Use(1)/do not use(0) OAI record status keeping in SQL database
Used by:
MySQLhdb.pm.svn-base; MySQLhdb.pm

9.1.9 extractLinksFromText

Default value
= 1
Description:
Extract(1)/do not extract(0) links from plain text
Used by:
combine; combine.svn-base

9.1.10 HarvesterMaxMissions

Default value
= 500
Description:
Number of pages to process before restarting the harvester
Used by:
combine; combine.svn-base

9.1.11 HarvestRetries

Default value
= 5
Used by:
combine; combine.svn-base

9.1.12 httpProxy

Default value
= NoDefaultValue
Description:
Use a proxy server if this is defined (default no proxy)
Used by:
UA.pm; UA.pm.svn-base

9.1.13 LogHandle

Used by:
classifySVM.pm.svn-base; FromHTML.pm; classifySVM.pm; PosCheck_record.pm; Check_record.pm; Check_record.pm.svn-base; PosCheck_record.pm.svn-base; FromHTML.pm.svn-base
Set by:
combineReClassify.svn-base; combine; combineReClassify; combine.svn-base

9.1.14 Loglev

Description:
Logging level (0 (least) - 10 (most))
Used by:
combine; combine.svn-base

9.1.15 maxUrlLength

Default value
= 250
Description:
Maximum length of a URL; longer will be silently discarded
Used by:
selurl.pm.svn-base; selurl.pm

9.1.16 MySQLdatabase

Default value
= NoDefaultValue
Description:
Identifies MySQL database name, user and host
Used by:
Config.pm; Config.pm.svn-base

9.1.17 MySQLfulltext

Description:
Enable(1)/disable(0) fulltext-index in MySQL table search
Used by:
MySQLhdb.pm.svn-base; MySQLhdb.pm

9.1.18 MySQLhandle

Used by:
combineSVM.svn-base; combineExport; MySQLhdb.pm.svn-base; combineReClassify.svn-base; combineCountry.pl; combineUtil.svn-base; classifySVM.pm.svn-base; combineUtil; RobotRules.pm.svn-base; LogSQL.pm; combine; combineExport.svn-base; classifySVM.pm; RobotRules.pm; combine.svn-base; SD_SQL.pm; combineRank; combineRank.svn-base; LogSQL.pm.svn-base; XWI2XML.pm; combineSVM; combineCountry.pl.svn-base; XWI2XML.pm.svn-base; MySQLhdb.pm; SD_SQL.pm.svn-base; combineReClassify
Set by:
Config.pm; Config.pm.svn-base

9.1.19 Operator-Email

Default value
= "YourEmailAdress@YourDomain"
Description:
Please change
Used by:
RobotRules.pm.svn-base; RobotRules.pm; UA.pm; UA.pm.svn-base

9.1.20 Password

Default value
= "XxXxyYzZ"
Description:
Password not used yet. (Please change)

9.1.21 PattiSpecial

Used by:
combine; combine.svn-base

9.1.22 relTextPlugin

Used by:
FromHTML.pm.svn-base; FromHTML.pm

9.1.23 saveHTML

Default value
= 1
Description:
Store(1)/do not store(0) the raw HTML in the database
Used by:
MySQLhdb.pm.svn-base; MySQLhdb.pm

9.1.24 SchedulingAlgorithm

Default value
= default
Description:
URL scheduling algorithm

9.1.25 SdqRetries

Default value
= 5

9.1.26 SolrHost

Default value
= NoDefaultValue
Description:
Direct connection to Solr indexing
Used by:
combineExport; MySQLhdb.pm.svn-base; combineExport.svn-base; MySQLhdb.pm

9.1.27 SummaryLength

Description:
How long the summary should be. Use 0 to disable the summarization code
Used by:
FromHTML.pm.svn-base; FromHTML.pm

9.1.28 SVMmodel

Default value
= NoDefaultValue
Description:
Filename for the SVM model
Used by:
classifySVM.pm.svn-base; classifySVM.pm

9.1.29 UAtimeout

Default value
= 30
Description:
Time in seconds to wait for a server to respond
Used by:
UA.pm; UA.pm.svn-base

9.1.30 UserAgentFollowRedirects

Description:
User agent handles redirects (1) or treat redirects as new links (0)
Used by:
UA.pm; UA.pm.svn-base

9.1.31 UserAgentGetIfModifiedSince

Default value
= 1
Description:
If we have seen this page before use Get-If-Modified (1) or not (0)
Used by:
UA.pm; UA.pm.svn-base

9.1.32 useTidy

Description:
Use(1)/do not use(0) Tidy to clean the HTML before parsing it
Used by:
FromHTML.pm.svn-base; FromHTML.pm

9.1.33 WaitIntervalExpirationGuaranteed

Default value
= 315360000
Used by:
UA.pm; UA.pm.svn-base

9.1.34 WaitIntervalHarvesterLockNotFound

Default value
= 2592000
Used by:
combine; combine.svn-base

9.1.35 WaitIntervalHarvesterLockNotModified

Default value
= 2592000
Used by:
combine; combine.svn-base

9.1.36 WaitIntervalHarvesterLockRobotRules

Default value
= 2592000
Used by:
combine; combine.svn-base

9.1.37 WaitIntervalHarvesterLockSuccess

Default value
= 1000000
Description:
Time in seconds after succesfull download before allowing a page to be downloaded again (around 11 days)
Used by:
combine; combine.svn-base

9.1.38 WaitIntervalHarvesterLockUnavailable

Default value
= 86400
Used by:
combine; combine.svn-base

9.1.39 WaitIntervalHost

Default value
= 60
Description:
Minimum time between accesses to the same host. Must be positive
Used by:
SD_SQL.pm; SD_SQL.pm.svn-base

9.1.40 WaitIntervalRrdLockDefault

Default value
= 86400
Used by:
RobotRules.pm.svn-base; RobotRules.pm

9.1.41 WaitIntervalRrdLockNotFound

Default value
= 345600
Used by:
RobotRules.pm.svn-base; RobotRules.pm

9.1.42 WaitIntervalRrdLockSuccess

Default value
= 345600
Used by:
RobotRules.pm.svn-base; RobotRules.pm

9.1.43 WaitIntervalSchedulerGetJcf

Default value
= 20
Description:
Time in seconds to wait before making a new reschedule if a reschedule results in an empty ready que
Used by:
combine; combine.svn-base

9.1.44 ZebraHost

Default value
= NoDefaultValue
Description:
Direct connection to Zebra indexing - for SearchEngine-in-a-box (default no connection)
Used by:
combineExport; MySQLhdb.pm.svn-base; combineExport.svn-base; MySQLhdb.pm

9.2 Complex configuration variables

9.2.1 allow

Description:
use either URL or HOST: (obs ’:’) to match regular expressions to
either the full URL or the HOST part of a URL.
Allow crawl of URLs or hostnames that matches these regular expressions
Used by:
selurl.pm.svn-base; selurl.pm

9.2.2 binext

Description:
Extensions of binary files
Used by:
UA.pm; UA.pm.svn-base

9.2.3 converters

Description:
Configure which converters can be used to produce a XWI object
Format:
1 line per entry
each entry consists of 3 ’;’ separated fields
Entries are processed in order and the first match is executed
external converters have to be found via PATH and executable to be considered a match
the external converter command should take a filename as parameter and convert that file
the result should be comming on STDOUT
mime-type ; External converter command ; Internal converter
Used by:
UA.pm; combine; UA.pm.svn-base; combine.svn-base

9.2.4 exclude

Description:
Exclude URLs or hostnames that matches these regular expressions
default: CGI and maps
default: binary files
default: Unparsable documents
default: images
default: other binary formats
more excludes in the file config_exclude (automatically updated by other programs)
Used by:
selurl.pm.svn-base; selurl.pm

9.2.5 serveralias

Description:
List of servernames that are aliases are in the file ./config_serveralias
(automatically updated by other programs)
use one server per line
example
www.100topwetland.com www.100wetland.com
means that www.100wetland.com is replaced by www.100topwetland.com during URL normalization

9.2.6 sessionids

Description:
patterns to recognize and remove sessionids in URLs

9.2.7 url

Description:
url is just a conatiner for all URL related configuration patterns
Used by:
selurl.pm.svn-base; Config.pm; Config.pm.svn-base; selurl.pm