4 Crawler internal operation

The system is designed for continuous operation. The harvester processes a URL in several steps as detailed in Figure 2. As a start-up initialization the frontier has to be seeded with some relevant URLs. All URLs are normalized before they are entered in the database. Data can be exported in various formats including the ALVIS XML document format13 and Dublin Core14 records.


Figure 2: Architecture for the Combine focused crawler.

The steps taken during crawling (numbers refer to Figure 2):

  1. The next URL is fetched from the scheduler.
  2. Combine obeys the Robots Exclusion Protocol15. Rules are cached locally.
  3. The page is retrieved using a GET, GET-IF-MODIFIED, or HEAD HTTP request.
  4. The HTML code is cleaned and normalized.
  5. The character-set is detected and normalized to UTF-8.
    1. The page (in any of the formats PDF, PostScript, MSWord, MSExcel, MSPowerPoint, RTF and TeX/LaTeX) is converted to HTML or plain text by an external program.
    2. Internal parsers handles HTML, plain text and images. This step extracts structured information like metadata (title, keywords, description ...), HTML links, and text without markup.
  6. The document is sent to the topic filter (see section 4.5). If the Web-page is relevant with respect to the focus topic, processing continues with:
    1. Heuristics like score propagation.
    2. Further analysis, like genre and language identification.
    3. Updating the record database.
    4. Updating the frontier database with HTML links and URLs extracted from plain text.

Depending on several factors like configuration, hardware, network, workload, the crawler normally processes between 50 and 200 URLs per minute.

4.1 URL selection criteria

In order to successfully select and crawl one URL the following conditions (in this order) have to be met:

  1. The URL has to be selected by the scheduling algorithm (section 4.4).

    Relevant configuration variables: WaitIntervalHost (section 9.1.39), WaitIntervalHarvesterLockRobotRules (section 9.1.36), WaitIntervalHarvesterLockSuccess (section 9.1.37)

  2. The URL has to pass the allow test.

    Relevant configuration variables: allow (section 9.2.1)

  3. The URL is not be excluded by the exclude test (see section 4.3).

    Relevant configuration variables: exclude (section 9.2.4)

  4. The Robot Exclusion Protocol has to allow crawling of the URL.
  5. Optionally the document at the URL location has to pass the topic filter (section 4.5).

    Relevant configuration variables: classifyPlugIn (section 9.1.4), doCheckRecord (section 9.1.7).

4.2 Document parsing and information extraction

Each document is parsed and analyzed by the crawler in order to store structured document records in the internal MySQL database. The structure of the record includes the fields:

Optional some extra analysis can be done, see section 4.8.

The system selects a document parser based on the Mime-Type together with available parsers and converter programs.

  1. For some mime-types an external program is called in order to convert the document to a format handled internally (HTML or plain text).

    Relevant configuration variables: converters (section 9.2.3)

  2. Internal parsers handle HTML, plain text, TeX, and Image.

    Relevant configuration variables: converters (section 9.2.3)

Supporting a new document format is as easy as providing a program that can convert a document in this format to HTML or plain text. Configuration of the mapping between document format (Mime-Type) and converter program is done in the complex configuration variable ’converters’ (section 9.2.3).

Out of the box Combine handle the following document formats: plain text, HTML, PDF, PostScript, MSWord, MSPowerPoint, MSExcel, RTF, TeX, and images.

4.3 URL filtering

Before an URL is accepted for scheduling (either by manual loading or recycling) it is normalized and validated. This process comprises a number of steps:

4.4 Crawling strategy

The crawler is designed to run continuously in order to keep crawled databases as up-to-date as possible. Starting and halting crawling is done manually. The configuration variable AutoRecycleLinks (section 9.1.2) determines if the crawler should follow all valid new links or just take those that already are marked for crawling.

All links from a relevant document are extracted, normalized and stored in the structured record. Those links that pass the selection/validation criteria outlined below are marked for crawling.

To mark a URL for crawling requires:

At each scheduling point one URL from each available (unlocked) host is selected to generate a ready queue, which is then processed completely before a new scheduling is done. Each selected URL in the ready queue thus fulfills these requirements:

This implements a variant of BreathFirst crawling where a page is fetched if and only if a certain time threshold is exceeded since the last access to the server of that page.

4.5 Built-in topic filter – automated subject classification using string matching

The built-in topic filter is an approach to automated classification, that uses a topic definition with a pre-defined controlled vocabulary of topical terms, to determine relevance judgement. Thus it does not rely on a particular set of seed pages, or a collection of pre-classified example pages to learn from. It does require that some of the seed pages are relevant and contain links into the topical area. One simple way of creating a set of seed pages would be to use terms from the controlled vocabulary as queries for a general-purpose search engine and take the result as seed pages.

The system for automated topic classification (overview in Figure 3), that determines topical relevance in the topical filter, is based on matching subject terms from a controlled vocabulary in a topic definition with the text of the document to be classified [3]. The topic definition uses subject classes in a hierarchical classification system (corresponding to topics) and terms associated with each subject class. Terms can be single words, phrases, or Boolean AND-expressions connecting terms. Boolean OR-expressions are implicitly handled by having several different terms associated with the same subject class (see section 4.5.1).

The algorithm works by string-to-string matching of terms and text in documents. Each time a match is found the document is awarded points based on which term is matched and in which structural part of the document (location) the match is found [10]. The points are summed to make the final relevance score of the document. If the score is above a cut-off value the document is saved in the database together with a (list of) subject classification(s) and term(s).


Figure 3: Overview of the automated topic classification algorithm

By providing a list of known relevant sites in the configuration file sitesOK.txt (located in the job specific configuration directory) the above test can be bypassed. It works by checking the host part of the URL against the list of known relevant sites and if a match is found the page is validated and saved in the database regardless of the outcome of the algorithm.

4.5.1 Topic definition

Located in /etc/combine/<jobname>/topicdefinition.txt. Topic definitions use triplets (term, relevance weight, topic-classes) as its basic entities. Weights are signed integers and indicate the relevance of the term with respect to the topic-classes. Higher values indicate more relevant terms. A large negative value can be used to exclude documents containing that term. Terms should be all lowercase.

Terms can be:

A Boolean OR-expression has to be entered as separate term triplets. The Boolean expression “polymer AND (atactic OR syndiotactic)” thus has to be translated into two separate triplets, one containing the term “polymer @and atactic”, and another with “polymer @and syndiotactic”.

Terms can include (Perl) regular expressions like:

It is important to understand that each triplet in the topic definition is considered by itself without any context, so they must each be topic- or sub-class specific in order to be useful. Subject neutral terms like “use”, “test”, “history” should not be used. If really needed they have to be qualified so that they become topic specific (see examples below).

Simple guidelines for creating the triplets and assigning weights are:

4.5.2 Topic definition (term triplets) BNF grammar

TERM-LIST :== TERM-ROW ’<cr>||#<char>+ ’<cr>||<cr >
WEIGHT :== [’-’]<integer>
TERMS :== TERM [’ @and ’ TERMS]*
TERM :== WORD ’ ’ [WORD]*
WORD :== <char>+||<char>+<perl-reg-exp>
CLASSID :== <char>+

A line that starts with ’#’ is ignored and so are empty lines.

<perl-reg-exp> is only supported by the plain matching algorithm described in section 4.5.4.

“CLASSID” is a topic (sub-)class specifier, often from a hierarchical classification system like Engineering Index16.

4.5.3 Term triplet examples
50: optical glass=A.14.5, D.2.2  
30: glass @and fiberoptics=D.2.2.8  
50: glass @and technical @and history=D.2  
50: ceramic materials @and glass=D.2.1.7  
-10000: glass @and art=A

The first line says that a document containing the term “optical glass” should be awarded 50 points for each of the two classes A.14.5 and D.2.2.

glass” as a single term is probably too general, qualify it with more terms like: “glass @and fiberoptics” , or “glass @and technical @and history” or use a phrase like “glass fiber” or “optical glass”.

In order to exclude documents about artistic use of glass the term “glass @and art” can be used with a (high) negative score.

An example from the topic definition for ’Carnivorous Plants’ using regular expressions is given below:

#This is a comment  
75: d\.?\s*californica=CP.Drosophyllum  
10: pitcher[^\s]*=CP  
-10: pitcher[^\s]* @and baseball=CP

The term “d\.?\s*californica” will match D californica, D. californica, D.californica etc.

The last two lines assure that a document containing “pitcher” gets 10 points but if the document also contains “baseball” the points are removed.

4.5.4 Algorithm 1: plain matching

This algorithm is selected by setting the configuration parameter
    classifyPlugIn = Combine::Check_record

The algorithm produces a list of suggested topic-classes (subject classifications) and corresponding relevance scores using the algorithm:

Relevance_score =

           (                                                              )
    ∑          ∑
           (         (hits[locationj][term  i]* weight[termi ]*weight [locationj]))
all locations  all terms

term weight
(weight[termi]) is taken from the topic definition triplets.
location weight
(weight[locationj]) are defined ad hoc for locations like title, metadata, HTML headings, and plain text. However the exact values for these weights does not seem to play a large role in the precision of the algorithm [10].
(hits[locationj][termi]) is the number of times termi occur in the text of locationj

The summed relevance score might, for certain applications, have to be normalized with respect to text size of the document.

One problem with this algorithm is that a term that is found in the beginning of the text contributes as much as a term that is found at the end of a large document. Another problem is the distance and thus the coupling between two terms in a Boolean expression might be very large in a big document and this is not taken into account by the above algorithm.

4.5.5 Algorithm 2: position weighted matching

This algorithm is selected by setting the configuration parameter
    classifyPlugIn = Combine::PosCheck_record

In response to the problems cited above we developed a modified version of the algorithm that takes into account word position in the text and proximity for Boolean terms. It also eliminates the need to assign ad hoc weights to locations. The new algorithm works as follows.

First all text from all locations are concatenated (in the natural importance order title, metadata, text) into one chunk of text. Matching of terms is done against this chunk. Relevance score is calculated as

Relevance_score =

         (                                                                 )
   ∑          ∑                           weight[termi]
         (           ------------------------------------------------------)
all terms all matches log(k * position[termi ][matchj])* proximity[termi ][matchj]

term weight
(weight[termi]) is taken from the topic definition triplets
(position[termi][matchj]) is the position in the text (starting from 1) for matchj of termi. The constant factor k is normally 0.5
(proximity[termi][matchj]) is
1 for non Boolean terms
log(distance_between_components)for Boolean terms

In this algorithm a matched term close to the start of text contributes more to the relevance score than a match towards the end of the text. And for Boolean terms the closer the components are the higher the contribution to the relevance score.

4.6 Built-in topic filter – automated subject classification using SVM

Topic filetring using SVM (Support Vector Machines) classifiers are supported using the SVMLight package17. This package has to be installed manually together with the Algorithm::SVMLight Perl module. For installation hints see CPAN SVMLight README18 or ’installing-algorithm-svmlight-linux-ubuntu19

SVM classifiers need a trained model before they can be used.

The procedure to get started is as follows:

4.7 Topic filter Plug-In API

The configuration variable classifyPlugIn (section 9.1.4) is used to find the Perl module that implements the desired topic filter. The value should be formatted as a valid Perl module identifier (i.e. the module must be somewhere in the Perl module search path). Combine will call a subroutine named ’classify’ in this module, providing an XWI-object as in parameter. An XWI-object is a structured object holding all information from parsing a Web-page. The subroutine must return either 0 or 1, where
  0: means record fails to meet the classification criteria, i.e. ignore this record
  1: means record is OK, store it in the database, and follow the links

More details on how to write a Plug-In can be found in the example classifyPlugInTemplate.pm (see Appendix A.2).

4.8 Analysis

Extra analysis, enabled by the configuration variable doAnalyse (section 9.1.6), tries to determine the language of the content and the country of the Web-server. Both are stored in the internal database.

4.9 Duplicate detection

Duplicates of crawled documents are automatically detected with the aid of a MD5-checksum calculated on the contents of the document.

The MD5-checksum is used as the master record key in the internal database thus preventing pollution with duplicate pages. All URLs for a page are stored in the record, and a page is not deleted from the database until the crawler has verified that it’s unavailable from all the saved URLs.

4.10 URL recycling

URLs for recycling come from 3 sources:

Automatic recycling of URLs is enabled by the configuration variable AutoRecycleLinks (section 9.1.2). It can also be done manually with the command
combineCtrl --jobname XXXX recyclelinks

The command combineCtrl --jobname XXXX reharvest marks all pages in the database for harvesting again.

4.11 Database cleaning

The tool combineUtil implements functionality for cleaning the database.

checks respectively restore consistency of the internal database.
deletes records from the database based on supplied parameters.
detects Web-server aliases in the database. All detected alias groups are added to the serveralias configuration (section 9.2.5). Records from aliased servers (except for the first Web-server) will be deleted.

4.12 Complete application – SearchEngine in a Box

The SearchEngine-in-a-Box22 system is based on the two systems Combine Focused Crawler and Zebra text indexing and retrieval engine23. This system allows you build a vertical search engine for your favorite topic in a few easy steps.

The SearchEngine-in-a-Box Web-site contains instructions and downloads to make this happen. Basically it makes use of the ZebraHost (see section 9.1.44) configuration variable which enables direct communication between the crawler and the database system and thus indexes records as soon as they are crawled. This also means that they are directly searchable.