Documentation for the Combine (focused) crawling system

Anders Ard, Koraljka Golub

June 16, 2009

Contents
I  Overview
1 Introduction
2 Open source distribution, installation
 2.1 Installation
 2.2 Getting started
 2.3 Online documentation
 2.4 Use scenarios
3 Configuration
 3.1 Configuration files
4 Crawler internal operation
 4.1 URL selection criteria
 4.2 Document parsing and information extraction
 4.3 URL filtering
 4.4 Crawling strategy
 4.5 Built-in topic filter – automated subject classification using string matching
 4.6 Built-in topic filter – automated subject classification using SVM
 4.7 Topic filter Plug-In API
 4.8 Analysis
 4.9 Duplicate detection
 4.10 URL recycling
 4.11 Database cleaning
 4.12 Complete application – SearchEngine in a Box
5 Evaluation of automated subject classification
 5.1 Approaches to automated classification
 5.2 Evaluation methodology
 5.3 Results
6 Performance and scalability
 6.1 Speed
 6.2 Space
 6.3 Crawling strategy
7 System components
 7.1 combineINIT
 7.2 combineCtrl
 7.3 combineUtil
 7.4 combineExport
 7.5 Internal executables and Library modules
References
II  Gory details
8 Frequently asked questions
9 Configuration variables
 9.1 Name/value configuration variables
 9.2 Complex configuration variables
10 Module dependences
 10.1 Programs
 10.2 Library modules
 10.3 External modules
III  
A APPENDIX
 A.1 Simple installation test
 A.2 Example topic filter plug in
 A.3 Default configuration files
 A.4 SQL database
 A.5 Manual pages