The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"  
  "http://www.w3.org/TR/html4/loose.dtd">  
<html > 
<head><title>Performance and scalability</title> 
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> 
<meta name="generator" content="TeX4ht (http://www.cse.ohio-state.edu/~gurari/TeX4ht/)"> 
<meta name="originator" content="TeX4ht (http://www.cse.ohio-state.edu/~gurari/TeX4ht/)"> 
<!-- html,2 --> 
<meta name="src" content="DocMain.tex"> 
<meta name="date" content="2009-06-16 09:20:00"> 
<link rel="stylesheet" type="text/css" href="DocMain.css"> 
</head><body 
>
   <!--l. 1--><div class="crosslinks"><p class="noindent">[<a 
href="DocMainse5.html" >prev</a>] [<a 
href="DocMainse5.html#tailDocMainse5.html" >prev-tail</a>] [<a 
href="#tailDocMainse6.html">tail</a>] [<a 
href="DocMainpa1.html# " >up</a>] </p></div>
   <h3 class="sectionHead"><span class="titlemark">6   </span> <a 
 id="x34-560006"></a>Performance and scalability</h3>
<!--l. 4--><p class="noindent" >Performance evaluation of the automated subject classification component is treated in section
<a 
href="DocMainse5.html#x31-430005">5<!--tex4ht:ref: autoclasseval --></a>.
<!--l. 7--><p class="noindent" >
   <h4 class="subsectionHead"><span class="titlemark">6.1   </span> <a 
 id="x34-570006.1"></a>Speed</h4>
<!--l. 9--><p class="noindent" >Performance in terms of number of URLs treated per minute is of course highly dependent on a
number of circumstances like network load, capacity of the machine, the selection of URLs to
crawl, configuration details, number of crawlers used, etc. In general, within rather wide
limits, you could expect the Combine system to handle up to 200 URLs per minute.
By &#8220;handle&#8221; we mean everything from scheduling of URLs, fetching pages over the
network, parsing the page, automated subject classification, recycling of new links, to
storing the structured record in a relational database. This holds for small simple
crawls starting from scratch to large complicated topic specific crawls with millions of
records.
<!--l. 22--><p class="indent" >   <hr class="figure"><div class="figure" 
><table class="figure"><tr class="figure"><td class="figure" 
>

<a 
 id="x34-570014"></a>

<div class="center" 
>
<!--l. 23--><p class="noindent" >
<!--l. 24--><p class="noindent" ><img 
src="DocMain9x.png" alt="PIC" class="graphics" width="366.70374pt" height="261.76135pt" ><!--tex4ht:graphics  
name="DocMain9x.png" src="CrawlerSpeed.ps"  
--></div>
<br /> <table class="caption" 
><tr style="vertical-align:baseline;" class="caption"><td class="id">Figure&#x00A0;4: </td><td  
class="content">Combine crawler performance, using no focus and configuration optimized for
speed.</td></tr></table><!--tex4ht:label?: x34-570014 -->

<!--l. 28--><p class="indent" >   </td></tr></table></div><hr class="endfigure">
<!--l. 30--><p class="indent" >   The prime way of increasing performance is to use more than one crawler for a job. This is
handled by the <span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">--harvesters</span></span></span> switch used together with the <span 
class="ectt-1095">combineCtrl start </span>command for
example <span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">combineCtrl</span><span 
class="ectt-1095">&#x00A0;--jobname</span><span 
class="ectt-1095">&#x00A0;MyCrawl</span><span 
class="ectt-1095">&#x00A0;--harvesters</span><span 
class="ectt-1095">&#x00A0;5</span><span 
class="ectt-1095">&#x00A0;start</span></span></span> will start 5 crawlers
working together on the job &#8217;MyCrawl&#8217;. The effect of using more than one crawler on
crawling speed is illustrated in figure <a 
href="#x34-570014">4<!--tex4ht:ref: crawlspeed --></a> and the resulting speedup is shown in table
<a 
href="#x34-570021">1<!--tex4ht:ref: speedup --></a>.
   <div class="table">

<!--l. 38--><p class="indent" >   <a 
 id="x34-570021"></a><hr class="float"><div class="float" 
><table class="float"><tr class="float"><td class="float" 
>

<div class="center" 
>
<!--l. 39--><p class="noindent" >
<div class="tabular"> <table class="tabular" 
cellspacing="0" cellpadding="0" rules="groups" 
><colgroup id="TBL-3-1g"><col 
id="TBL-3-1"></colgroup><colgroup id="TBL-3-2g"><col 
id="TBL-3-2"></colgroup><colgroup id="TBL-3-3g"><col 
id="TBL-3-3"></colgroup><colgroup id="TBL-3-4g"><col 
id="TBL-3-4"></colgroup><colgroup id="TBL-3-5g"><col 
id="TBL-3-5"></colgroup><colgroup id="TBL-3-6g"><col 
id="TBL-3-6"></colgroup><colgroup id="TBL-3-7g"><col 
id="TBL-3-7"></colgroup><tr 
class="hline"><td><hr></td><td><hr></td><td><hr></td><td><hr></td><td><hr></td><td><hr></td><td><hr></td></tr><tr  
 style="vertical-align:baseline;" id="TBL-3-1-"><td  style="white-space:nowrap; text-align:left;" id="TBL-3-1-1"  
class="td11"><span 
class="ecbx-1095">No of crawlers</span></td><td  style="white-space:nowrap; text-align:center;" id="TBL-3-1-2"  
class="td11">1</td><td  style="white-space:nowrap; text-align:center;" id="TBL-3-1-3"  
class="td11"> 2 </td><td  style="white-space:nowrap; text-align:center;" id="TBL-3-1-4"  
class="td11"> 5 </td><td  style="white-space:nowrap; text-align:center;" id="TBL-3-1-5"  
class="td11">10</td><td  style="white-space:nowrap; text-align:center;" id="TBL-3-1-6"  
class="td11">15</td><td  style="white-space:nowrap; text-align:center;" id="TBL-3-1-7"  
class="td11"> 20 </td>
</tr><tr 
class="hline"><td><hr></td><td><hr></td><td><hr></td><td><hr></td><td><hr></td><td><hr></td><td><hr></td></tr><tr  
 style="vertical-align:baseline;" id="TBL-3-2-"><td  style="white-space:nowrap; text-align:left;" id="TBL-3-2-1"  
class="td11"><span 
class="ecbx-1095">Speedup         </span></td><td  style="white-space:nowrap; text-align:center;" id="TBL-3-2-2"  
class="td11">1</td><td  style="white-space:nowrap; text-align:center;" id="TBL-3-2-3"  
class="td11">2.0</td><td  style="white-space:nowrap; text-align:center;" id="TBL-3-2-4"  
class="td11">4.8</td><td  style="white-space:nowrap; text-align:center;" id="TBL-3-2-5"  
class="td11">8.2</td><td  style="white-space:nowrap; text-align:center;" id="TBL-3-2-6"  
class="td11">9.8</td><td  style="white-space:nowrap; text-align:center;" id="TBL-3-2-7"  
class="td11">11.0</td>
</tr><tr 
class="hline"><td><hr></td><td><hr></td><td><hr></td><td><hr></td><td><hr></td><td><hr></td><td><hr></td></tr><tr  
 style="vertical-align:baseline;" id="TBL-3-3-"><td  style="white-space:nowrap; text-align:left;" id="TBL-3-3-1"  
class="td11">               </td>
</tr></table></div></div>
<br /> <table class="caption" 
><tr style="vertical-align:baseline;" class="caption"><td class="id">Table&#x00A0;1: </td><td  
class="content">Speedup of crawling vs number of crawlers</td></tr></table><!--tex4ht:label?: x34-570021 -->

   </td></tr></table></div><hr class="endfloat" />
   </div>
<!--l. 50--><p class="indent" >   Configuration also has an effect on performance. In Figure <a 
href="#x34-570035">5<!--tex4ht:ref: config --></a> performance improvements
based on configuration changes are shown. The choice of algorithm for automated classification
turns out to have biggest influence on performance, where algorithm 2 &#8211; section <a 
href="DocMainse4.html#x19-350004.5.5">4.5.5<!--tex4ht:ref: pos --></a> &#8211;
(<span 
class="ectt-1095">classifyPlugIn = Combine::PosCheck_record </span>&#8211; Pos in Figure <a 
href="#x34-570035">5<!--tex4ht:ref: config --></a>) is much faster than
algorithm 1 &#8211; section <a 
href="DocMainse4.html#x19-340004.5.4">4.5.4<!--tex4ht:ref: std --></a> &#8211; (<span 
class="ectt-1095">classifyPlugIn = Combine::Check_record </span>&#8211; Std in Figure <a 
href="#x34-570035">5<!--tex4ht:ref: config --></a>).
Configuration optimization consisted of not using Tidy to clean HTML (<span 
class="ectt-1095">useTidy = 0</span>) and not
storing the original page in the database (<span 
class="ectt-1095">saveHTML = 0</span>). Tweaking of other configuration
variables (like disabling logging to the MySQL database <span 
class="ectt-1095">Loglev = 0</span>) also has an effect on
performance but to a lesser degree.
<!--l. 62--><p class="indent" >   <hr class="figure"><div class="figure" 
><table class="figure"><tr class="figure"><td class="figure" 
>

<a 
 id="x34-570035"></a>

<div class="center" 
>
<!--l. 63--><p class="noindent" >
<!--l. 64--><p class="noindent" ><img 
src="DocMain10x.png" alt="PIC" class="graphics" width="366.70374pt" height="261.76135pt" ><!--tex4ht:graphics  
name="DocMain10x.png" src="Config.ps"  
--></div>
<br /> <table class="caption" 
><tr style="vertical-align:baseline;" class="caption"><td class="id">Figure&#x00A0;5: </td><td  
class="content">Effect of configuration changes on focused crawler performance, using 10 crawlers
and a topic definition with 2512 terms.</td></tr></table><!--tex4ht:label?: x34-570035 -->

<!--l. 68--><p class="indent" >   </td></tr></table></div><hr class="endfigure">
   <h4 class="subsectionHead"><span class="titlemark">6.2   </span> <a 
 id="x34-580006.2"></a>Space</h4>
<!--l. 72--><p class="noindent" >Storing structured records including the original document takes quite a lot of disk space. On
average 25 kB per record is used by MySQL. This includes the administrative overhead needed
for the operation of the crawler. A database with 100&#x00A0;000 records needs at least 2.5
GB on disk. Deciding not to store the original page in the database (<span 
class="ectt-1095">saveHTML = 0</span>)
gives considerable space savings. On average 8 kB per is used without the original
HTML.
<!--l. 98--><p class="indent" >   Exporting records in the ALVIS XML format further increases size to 42 kB per record.
Using the slight less redundant XML-format <span 
class="ectt-1095">combine </span>uses 27 kB per record. Thus 100&#x00A0;000
records will generate a file of size 3 to 4 GB. The really compact Dublin Core format (<span 
class="ectt-1095">dc</span>)
generates 0.65 kB per record.
<!--l. 103--><p class="noindent" >
   <h4 class="subsectionHead"><span class="titlemark">6.3   </span> <a 
 id="x34-590006.3"></a>Crawling strategy</h4>
<!--l. 105--><p class="noindent" >In <span class="cite">[<a 
href="DocMainli2.html#XRafael06">19</a>]</span> four different crawling strategies are studied:
     <dl class="description"><dt class="description">
<span 
class="ecbx-1095">BreadthFirst</span> </dt><dd 
class="description">The simplest strategy for crawling. It does not utilize heuristics in deciding
     which URL to visit next. It uses the frontier as a FIFO queue, crawling links in the
     order in which they are encountered.
     </dd><dt class="description">
<span 
class="ecbx-1095">BestFirst</span> </dt><dd 
class="description">The basic idea is that given a frontier of URLs, the best URL according to some
     estimation criterion is selected for crawling, using the frontier as a priority queue. In
     this implementation, the URL selection process is guided by the topic score of the
     source page as calculated by Combine.
     </dd><dt class="description">
<span 
class="ecbx-1095">PageRank</span> </dt><dd 
class="description">The same as Best-First but ordered by PageRank calculated from the pages
     crawled so far.
     </dd><dt class="description">
<span 
class="ecbx-1095">BreadthFirstTime</span> </dt><dd 
class="description">A version of BreadthFirst. It is based on the idea of not accessing
     the same server during a certain period of time in order not to overload servers. Thus,
     a page is fetched if and only if a certain time threshold is exceeded since the last
     access to the server of that page.
     </dd></dl>
<!--l. 133--><p class="indent" >   Results from a simulated crawl (figure <a 
href="#x34-590016">6<!--tex4ht:ref: crawlstrategy --></a> from <span class="cite">[<a 
href="DocMainli2.html#XRafael06">19</a>]</span>) show that at first PageRank performs best
but BreadthFirstTime (which is used in Combine) prevails in the long run, although differences
are small.
<!--l. 137--><p class="indent" >   <hr class="figure"><div class="figure" 
><table class="figure"><tr class="figure"><td class="figure" 
>

<a 
 id="x34-590016"></a>

<div class="center" 
>
<!--l. 138--><p class="noindent" >
<!--l. 140--><p class="noindent" ><img 
src="DocMain11x.png" alt="PIC" class="graphics" width="398.34262pt" height="256.06845pt" ><!--tex4ht:graphics  
name="DocMain11x.png" src="crawl.ps"  
-->
<br /> <table class="caption" 
><tr style="vertical-align:baseline;" class="caption"><td class="id">Figure&#x00A0;6: </td><td  
class="content">Total number of relevant pages visited</td></tr></table><!--tex4ht:label?: x34-590016 -->
</div>

<!--l. 144--><p class="indent" >   </td></tr></table></div><hr class="endfigure">

   <!--l. 1--><div class="crosslinks"><p class="noindent">[<a 
href="DocMainse5.html" >prev</a>] [<a 
href="DocMainse5.html#tailDocMainse5.html" >prev-tail</a>] [<a 
href="DocMainse6.html" >front</a>] [<a 
href="DocMainpa1.html# " >up</a>] </p></div>
<!--l. 1--><p class="indent" >   <a 
 id="tailDocMainse6.html"></a>   
</body></html>