Contents

I  Overview
1 Introduction
2 Open source distribution, installation
 2.1 Installation
  2.1.1 Installation from source for the impatient
  2.1.2 Porting to not supported operating systems - dependencies
  2.1.3 Automated Debian/Ubuntu installation
  2.1.4 Manual installation
  2.1.5 Out-of-the-box installation test
 2.2 Getting started
 2.3 Online documentation
 2.4 Use scenarios
  2.4.1 General crawling without restrictions
  2.4.2 Focused crawling – domain restrictions
  2.4.3 Focused crawling – topic specific
  2.4.4 Focused crawling in an Alvis system
  2.4.5 Crawl one entire site and it’s outlinks
3 Configuration
 3.1 Configuration files
  3.1.1 Templates
  3.1.2 Global configuration files
  3.1.3 Job specific configuration files
  3.1.4 Details and default values
4 Crawler internal operation
 4.1 URL selection criteria
 4.2 Document parsing and information extraction
 4.3 URL filtering
 4.4 Crawling strategy
 4.5 Built-in topic filter – automated subject classification using string matching
  4.5.1 Topic definition
  4.5.2 Topic definition (term triplets) BNF grammar
  4.5.3 Term triplet examples
  4.5.4 Algorithm 1: plain matching
  4.5.5 Algorithm 2: position weighted matching
 4.6 Built-in topic filter – automated subject classification using SVM
 4.7 Topic filter Plug-In API
 4.8 Analysis
 4.9 Duplicate detection
 4.10 URL recycling
 4.11 Database cleaning
 4.12 Complete application – SearchEngine in a Box
5 Evaluation of automated subject classification
 5.1 Approaches to automated classification
  5.1.1 Description of the used string-matching algorithm
 5.2 Evaluation methodology
  5.2.1 Evaluation challenge
  5.2.2 Evaluation measures used
  5.2.3 Data collection
 5.3 Results
  5.3.1 The role of different thesauri terms
  5.3.2 Enriching the term list using natural language processing
  5.3.3 Importance of HTML structural elements and metadata
  5.3.4 Challenges and recommendations for classification of Web pages
  5.3.5 Comparing and combining two approaches
6 Performance and scalability
 6.1 Speed
 6.2 Space
 6.3 Crawling strategy
7 System components
 7.1 combineINIT
 7.2 combineCtrl
 7.3 combineUtil
 7.4 combineExport
 7.5 Internal executables and Library modules
  7.5.1 Library
II  Gory details
8 Frequently asked questions
9 Configuration variables
 9.1 Name/value configuration variables
  9.1.1 analysePlugin
  9.1.2 AutoRecycleLinks
  9.1.3 baseConfigDir
  9.1.4 classifyPlugIn
  9.1.5 configDir
  9.1.6 doAnalyse
  9.1.7 doCheckRecord
  9.1.8 doOAI
  9.1.9 extractLinksFromText
  9.1.10 HarvesterMaxMissions
  9.1.11 HarvestRetries
  9.1.12 httpProxy
  9.1.13 LogHandle
  9.1.14 Loglev
  9.1.15 maxUrlLength
  9.1.16 MySQLdatabase
  9.1.17 MySQLfulltext
  9.1.18 MySQLhandle
  9.1.19 Operator-Email
  9.1.20 Password
  9.1.21 PattiSpecial
  9.1.22 relTextPlugin
  9.1.23 saveHTML
  9.1.24 SchedulingAlgorithm
  9.1.25 SdqRetries
  9.1.26 SolrHost
  9.1.27 SummaryLength
  9.1.28 SVMmodel
  9.1.29 UAtimeout
  9.1.30 UserAgentFollowRedirects
  9.1.31 UserAgentGetIfModifiedSince
  9.1.32 useTidy
  9.1.33 WaitIntervalExpirationGuaranteed
  9.1.34 WaitIntervalHarvesterLockNotFound
  9.1.35 WaitIntervalHarvesterLockNotModified
  9.1.36 WaitIntervalHarvesterLockRobotRules
  9.1.37 WaitIntervalHarvesterLockSuccess
  9.1.38 WaitIntervalHarvesterLockUnavailable
  9.1.39 WaitIntervalHost
  9.1.40 WaitIntervalRrdLockDefault
  9.1.41 WaitIntervalRrdLockNotFound
  9.1.42 WaitIntervalRrdLockSuccess
  9.1.43 WaitIntervalSchedulerGetJcf
  9.1.44 ZebraHost
 9.2 Complex configuration variables
  9.2.1 allow
  9.2.2 binext
  9.2.3 converters
  9.2.4 exclude
  9.2.5 serveralias
  9.2.6 sessionids
  9.2.7 url
10 Module dependences
 10.1 Programs
  10.1.1 Check_record.pm.svn-base
  10.1.2 CleanXML2CanDoc.pm.svn-base
  10.1.3 Config.pm.svn-base
  10.1.4 DataBase.pm.svn-base
  10.1.5 FromHTML.pm.svn-base
  10.1.6 FromImage.pm.svn-base
  10.1.7 HTMLExtractor.pm.svn-base
  10.1.8 LoadTermList.pm.svn-base
  10.1.9 LogSQL.pm.svn-base
  10.1.10 Matcher.pm.svn-base
  10.1.11 MySQLhdb.pm.svn-base
  10.1.12 PosCheck_record.pm.svn-base
  10.1.13 PosMatcher.pm.svn-base
  10.1.14 RobotRules.pm.svn-base
  10.1.15 SD_SQL.pm.svn-base
  10.1.16 Solr.pm.svn-base
  10.1.17 UA.pm.svn-base
  10.1.18 XWI.pm.svn-base
  10.1.19 XWI2XML.pm.svn-base
  10.1.20 Zebra.pm.svn-base
  10.1.21 classifySVM.pm.svn-base
  10.1.22 combine
  10.1.23 combine.svn-base
  10.1.24 combineCountry.pl
  10.1.25 combineCountry.pl.svn-base
  10.1.26 combineCtrl
  10.1.27 combineCtrl.svn-base
  10.1.28 combineExport
  10.1.29 combineExport.svn-base
  10.1.30 combineINIT
  10.1.31 combineINIT.svn-base
  10.1.32 combineRank
  10.1.33 combineRank.svn-base
  10.1.34 combineReClassify
  10.1.35 combineReClassify.svn-base
  10.1.36 combineSVM
  10.1.37 combineSVM.svn-base
  10.1.38 combineUtil
  10.1.39 combineUtil.svn-base
  10.1.40 selurl.pm.svn-base
  10.1.41 utilPlugIn.pm.svn-base
 10.2 Library modules
  10.2.1 Check_record.pm
  10.2.2 CleanXML2CanDoc.pm
  10.2.3 Config.pm
  10.2.4 DataBase.pm
  10.2.5 FromHTML.pm
  10.2.6 FromImage.pm
  10.2.7 HTMLExtractor.pm
  10.2.8 LoadTermList.pm
  10.2.9 LogSQL.pm
  10.2.10 Matcher.pm
  10.2.11 MySQLhdb.pm
  10.2.12 PosCheck_record.pm
  10.2.13 PosMatcher.pm
  10.2.14 RobotRules.pm
  10.2.15 SD_SQL.pm
  10.2.16 Solr.pm
  10.2.17 UA.pm
  10.2.18 XWI.pm
  10.2.19 XWI2XML.pm
  10.2.20 Zebra.pm
  10.2.21 classifySVM.pm
  10.2.22 selurl.pm
  10.2.23 utilPlugIn.pm
 10.3 External modules
III  
A APPENDIX
 A.1 Simple installation test
  A.1.1 InstallationTest.pl
 A.2 Example topic filter plug in
  A.2.1 classifyPlugInTemplate.pm
 A.3 Default configuration files
  A.3.1 Global
  A.3.2 Job specific
 A.4 SQL database
  A.4.1 Create database
  A.4.2 Creating MySQL tables
  A.4.3 Data tables
  A.4.4 Administrative tables
  A.4.5 Create user dbuser with required priviligies
 A.5 Manual pages
  A.5.1 combineExport
  A.5.2 combineCtrl
  A.5.3 combineRun
  A.5.4 combineReClassify
  A.5.5 combineSVM
  A.5.6 combineRank
  A.5.7 combineUtil
  A.5.8 combine
  A.5.9 Combine::PosMatcher
  A.5.10 Combine::selurl
  A.5.11 Combine::XWI
  A.5.12 Combine::Matcher
  A.5.13 Combine::FromTeX
  A.5.14 Combine::utilPlugIn
  A.5.15 Combine::SD_SQL
  A.5.16 Combine::FromHTML
  A.5.17 Combine::RobotRules
  A.5.18 Combine::HTMLExtractor
  A.5.19 Combine::LoadTermList
  A.5.20 Combine::classifySVM