getcost, version 3.0 Sun May 4, 1997 ============================================================================== Forward by RAM: This version of getcost CANNOT be redistributed separately from the mailagent it came with. As part of the mailagent distribution, it may be copied freely. Darryl Okahata kindly granted me the permission to redistribute it as well as a stripped-out sample of the spamconfig file. This is what lies within this directory, along with the README file. ============================================================================== Getcost is a (somewhat kludgy) Perl script designed to detect SPAM. It reads in a single, incoming mail message from standard input, and prints a "cost", or message score, to stdout. Its purpose is to assign a "cost" to the mail message; if the cost is less than zero, it's considered to be SPAM. If the cost is greater than or equal to be zero, it's not considered to be SPAM. Getcost is designed to be used in conjunction with mail filtering programs like deliver(8) or procmail(1). Basically, if getcost returns a negative number, the message is considered to be SPAM, and deliver(8) or procmail(1) should file the message into a "junkmail" folder. It is recommended that the message *NOT* be deleted, as getcost can incorrectly classify a valid message as SPAM (this is known as a "false positive"). The cost is computed from various "scores", which are stored in a configuration file in the user's $HOME directory. The configuration file contains a list of pattern/score pairs. Basically, if an header field or line in the message body matches a pattern, the corresponding score is added to the "cost". After processing all of the header and body lines (body scanning can stop prematurely, to help with long messages), the total cost is printed to stdout. Patterns can be matched against the following headers: From: & Reply-To: Sender: To: & Cc: Received Subject: In addition, special kludgy code exists to handle the headers: X-Openmail-Hops: Newsgroups: & Xref: References: & In-Reply-To:\s/i & X-Also-Posted-To: Also, if 10 or more all-uppercase lines exist in the body, a score of -10000 is added to the cost. Each all-uppercase line must be more than 10 characters long, *AND* no patterns must have matched for this line, in order for it to be counted. [ Yes, this is a bit kludgy, but, believe it or not, the code has greatly improved since it was first conceived. ] Questionable code also exists to automatically mark the message as SPAM if the To: header is more than 1000 characters long. This code should probably be deleted/modified. Scanning stops as soon as the cost goes above 1000000 or below -100000. This distribution includes three files: getcost -- The SPAM-detecting engine. getcost.README -- This file. spamconfig -- A sample configuration file, with comments. To use this file, either copy it to $HOME/.spamconfig, or use getcost with the "-f" option (see below). For debugging a configuration file, "-d" and "-D" options are provided (see below). Note that $HOME should be set when this script is executed, so that getcost can locate its config file. If $HOME is not set, the "-f" option (see below) must be used. ***** Options: -a Apply all body rules, i.e. do not stop scoring for a body line at the first match. Scoring always stops when the delta cost associated with a pattern match is 0. -B Penalize binary characters. Each line that contains a character with bit 7 set gets a score of -100. This is done only if no other patterns matched. -b Don't do any score processing on the message body. Scoring is done only upon the message header. -c Collect sentences and only apply patterns on full ones. This is useful in conjunction with -a, and also fights spam messages with a word on each line, to specifically defeat scoring filters which rely on consecutive words. This has no impact on -L, which still focuses on physical lines in the message, not logical ones. -D (For debugging) Dump scanner code. For performance reasons, the scanner for the message body is converted into Perl code at runtime. This option dumps out the converted Perl code, in case syntax errors occur (which can be common, as the regular expressions are taken directly from the configuration file). -d (For debugging) Debug mode. Prints out how each line is assigned a score. This option is invaluable for seeing how getcost arrives at a cost. -f configfile Use "configfile" instead of $HOME/.spamconfig -k Keep going applying the rules, even when the -1000000 threshold is reached. Useful in conjunction with -d on a tty to debug a set of patterns. -L Penalize long lines. Lines greater than 90 characters get assigned a cost of: -(10 + line_length). For example, a 110-character long line is assigned a cost of -(10 + 110) or -120. This is done only if no other patterns matched. -m Multiple matches: apply pattern everywhere on the line, not just once. Useful with -c, to keep counting everything. NB: this re-uses the deprecated "-m" flag which was adding 500 to the cost when your name appeared in the To: line. -o If the X-Openmail-Hops: header line exists, assign a cost of 10000000 to the line. Generally used to allow all openmail messages through. -S If the To: header contains the phrase "list suppressed", a cost of -1000000 is assigned. This is of questionable use (but, messages from spammers who use Eudora often have this).