Circa

Version française
 
     
 

Presentation

Circa is a search engine for your Web site, or for a list of sites. It indexes like Altavista does. It can read, add and parse all url's found in a page, if the page is on the same server.

Circa is free, under GNU license

Try-it !

Make a search on AlianWebServer :

Or try advanced search.

Features

  • Full text indexing
  • Different weights for title, keywords, description and rest of page HTML read can be given in configuration
  • Boolean query language support : or (default) and ("+") not ("-"). Ex perl + faq -cgi : Documents with faq, eventually perl and not cgi.
  • Support protocol HTTP,FTP
  • Make index in MySQL
  • Client Perl or PHP
  • Read HTML and full text plain
  • Can do indexation of filesystem without talk to Web Server
  • Can browse site by directory / rubrique.
  • Several kinds of indexing : full, incremental, only on a particular server. Documents not updated are not reindexed. All requests for a file are made first with a head http request, for information such as validate, last update, size, etc.
  • Size of documents read can be restricted (Ex: don't get all documents > 5 MB). For use with low-bandwidth connections, or computers which do not have much memory.
  • HTML template can be easily customized for your needs.
  • Search for different criteria: news, last modified date, language, URL / site.
  • Admin functions available by browser interface or command-line.
  • Full support of standard robots exclusion (robots.txt). Identification with CircaIndexer/0.1, mail alian@alianwebserver.com.
  • Delay requests to the same server for 8 secondes. "It's not a bug, it's a feature!" Basic rule for HTTP serveur load.
  • Index the different links found in a CGI (all after name_of_file?)
  • Support proxy HTTP

To do

  • Support NNTP
  • Support of different character sets
  • Support of other bases
  • Requirement
  • MySQL
  • Perl
  • Modules DBI, DBD::mysql,LWP::RobotUA,HTML::LinkExtor;

Benchmark

Memory : Indexation : 5,5M
Processeur : on Sun SPARC Station 4 : (5 secondes à 2%, 2s. à 20%, 1s. à 30%) / url indexée.
Size on MySQL: 2-5 ko / url.

Make index is a big work so it's not for CGI protocol. Try to use admin.pl to update index; if you don't have telnet acces, try to lunch processus on background with another CGI. Or install MySQL on local disk, make your index, and export index on you sarch machine.

Install

  • Download one of archive file, uncompress it.
  • You must update search.cgi and search.pl (script for search) admin.cgi and admin.pl (script for admin) for put your MYSQL param :user, password, database and ip adress if different from 'localhost'.
  • Run admin.cgi (CGI interface) or admin.pl (command line) for add your url, drop or create tables, ... I suggest to prefer use admin.pl on command line because indexation can take a lot of time and is not adapted for CGI
  • Run search.cgi. You can use the default form for use in your page. Only field 'words' is necessary.
  • For customized HTML result, look in file circa.htm

Documentation

Documentation POD is available, use pod2html name_of_file.pm > name_of_file.html for read it.

Download

If you have root privileges and can install Perl modules, you can install this two modules : Circa::Search et Circa::Indexer. See directory demo for how use this module. Install Circa::Indexer first.

Else, you can use this distrib :

Format ZIP or Format tar.gz

Author

Alain BARBET alian@alianwebserver.com

Reference

Rules and security with :

http://info.webcrawler.com/mak/projects/robots/robots.html

Feature :

http://search.mnogo.ru/features.html

Why ?

I read of this need, I needed one for AlianWebServer, and I think other people need it too.

 
   
 
 
Powered by AlianWebServer