1 Introduction

The Combine system is an open, free, and highly configurable system for focused crawling of Internet resources. It aims at providing a robust and efficient tool for creating topic-specific moderate sized databases (up to a few million records). Crawling speed is around 200 URLs per minute and a complete structured record takes up an average of 25 kilobytes disk-space.


PIC


Figure 1: Overview of the Combine focused crawler.


Main features include:

Naturally it obeys the Robots Exclusion Protocol3 and behaves nice to Web-servers. Besides focused crawls (generating topic-specific databases), Combine supports configurable rules on what’s crawled based on regular expressions on URLs (URL focus filter). The crawler is designed to run continuously in order to keep crawled databases as up to date as possible. It can be stopped and restarted any time without loosing any status or information.

The operation of Combine (overview in Figure 1) as a focused crawler is based on a combination of a general Web crawler and an automated subject classifier. The topic focus is provided by a focus filter using a topic definition implemented as a thesaurus, where each term is connected to a topic class.

Crawled data are stored as a structured records in a local relational database.

Section 2 outlines how to download, install and test the Combine system and includes use scenarios – useful in order to get a jump start at using the system.

Section 3 discusses configuration structure and highlights a few important configuration variables.

Section 4 describes policies and methods used by the crawler.

Evaluation and performance are treated in sections 5 and 6.

The system has a number of components (see section 7), the main ones visible to the user being combineCtrl which is used to start and stop crawling and view crawler status, and combineExport that extracts crawled data from the internal database and exports them as XML records.

Further details (lots and lots of them) can be found in part II ’Gory details’ and in Appendix A.