BuzzSaw - Filters

Introduction

After each log entry has been parsed into its constituent parts it is next passed on to the data filtering stage of the importer pipeline. This is the stage at which decisions are made as to whether a log entry should be stored into the database.

If no filters have been specified then the importer process will import all entries into the database. Otherwise, if there are 1 or more filters, then no entries are accepted by default, at least one filter must declare an interest in a log entry for it to be stored. Each log entry is passed through the entire stack of filters in the order in which they were specified, filtering does not stop after any one plugin has declared an interest.

As well as declaring an interest in a particular log entry filters are permitted to attach tags and associate other information with an entry, which will be stored in the database for later retrieval (e.g. when generating reports). The result of this is that subsequent filters can make decisions and do further processing based on the results of filters earlier in the pipeline. It is worth noting however that an entry will be accepted if ANY filter declares an interest, it is not possible for filters later in the stack to overturn the results of those previous in the stack. It is possible to make decisions in one filter based on the results of previous filters. There are no limits to what processing you may do on each log entry but clearly when the filter has to be run on every entry and there might be millions to process it is worth doing only the minimum required if you want the entire process to complete in a reasonable amount of time.

Structure of a log entry

When an entry is successfully parsed the separate parts are placed into a simple Perl hash - for speed reasons this is not done in an Object-Oriented style. The following elements will be available for querying during the filtering stage.

Note that syslog entries come in a huge variety of styles so some fields such as program and pid are not always specified. Even when they are present it is not always that easy to extract the information without making the regular expression wildly complicated. For details on how the strings are parsed see the documentation for BuzzSaw::Parser and BuzzSaw::Parser::RFC3339.

Implementing a Filter

A BuzzSaw filter is implemented as a Perl class using the Moose Object-Oriented framework. It must implement the BuzzSaw::Filter role and provide a check() method. For example:

For every parsed log entry the check() method will be called with the following arguments:

  1. A reference to a hash containing values for the elements described above.
  2. The current number of positive votes cast in favour of storing the event.
  3. A reference to an array which provides further details on the decisions reached earlier in the filter stack. Each element in the array is itself a reference to an array where the first element is the name of the filter and the second element is the value returned for the vote (see below for details).

The method must return one of the following values:

1 (one) - $BuzzSaw::Report::VOTE_KEEP
Positive vote in favour of having the event stored. Any tags returned will also be stored.
0 (zero) - $BuzzSaw::Report::VOTE_NO_INTEREST
Negative vote, no interest in having the event stored. Any tags returned will be ignored.
-1 (negative one) - $BuzzSaw::Report::VOTE_NEUTRAL
Neutral, go with the result of other filters in the stack. Any tags returned will be stored.

The first two options are fairly straightforward. The third may seem a little peculiar but it becomes useful when you need to write a filter which is designed to make decisions based on the results of other filter modules which are placed earlier in the stack. For example, the BuzzSaw::Filter::UserClassifier module will classify the value of the userid field if it has been added by a previous filter (e.g. SSH or Cosign) and extra information will be associated with the event.

Optionally, a filter may also return a list of tags (simple strings) which should be associated with this log entry when it is stored.

Here is a particularly trivial first example which shows a filter which will return true if the event has a value for the program field and it matches the kernel string.

package BuzzSaw::Filter::Kernel;
use Moose;

with 'BuzzSaw::Filter';

sub check {
  my ( $self, $event, $votes, $results ) = @_;

  return ( exists $event->{program} && $event->{program} eq 'kernel' );
}

Tags

Returning a list of tags is useful to aid later searching and reporting. It is not obligatory but clearly it is going to be simpler to write an SQL query which states "show me all events with the 'authfail' tag" than it is to parse the various strings (again) to search for SSH login events which contain particular error messages. If nothing else this stores the results of the filter process which avoids duplication of code and effort in two different languages. The set of collected tags from all filters in the stack which express an interest in the entry are uniqueified and stored in the tags table in the database.

Here is a slightly more involved version of the previous example which shows how to add a simple tag (named segfault) when the event message contains the word segfault.

package BuzzSaw::Filter::Kernel;
use Moose;

with 'BuzzSaw::Filter';

sub check {
  my ( $self, $event, $votes, $results ) = @_;

  my $accept = 0;
  my @tags;
  if ( exists $event->{program} && $event->{program} eq 'kernel' ) {
    push @tags, 'kernel';
    $accept = 1;

    if ( $event->{message} =~ m/segfault/o ) {
      push @tags, 'segfault';
    }
  }

  return ( $accept, @tags );
}

Extra Information

As mentioned previously, it is also possible to attach extra information to a log entry which is going to be stored. This is done via the extra_info hash element, it is a reference to a simple Perl hash of keys and string values. For example, the SSH filter uses this approach to store the source address for each SSH login event log entry. These keys and values will be stored in the extra_info table in the database. Extra information can be specified like this:

$event->{extra_info}{source_address} = '10.0.0.0';
$event->{extra_info}{auth_method}    = 'password';

Note that for data protection reasons the stored log messages are anonymised after a certain period of time. The tag data is assumed to be safe and is all kept. Currently all other extra information is considered to be risky and it is thus deleted when an event is anonymised. So, don't rely on the extra information being available for really long-term statistical analysis.

What to accept

It is very tempting, for the sake of speed and simplicity, to write a filter which just declares an interest in every event with the correct program string. In a few cases this might be the right thing to do but more often it is better to do further filtering based on the message to see if it really is of genuine interest. The design of BuzzSaw is to only store events of real interest, filling the database with data for events you will never subsequently examine adds in a lot more noise to the stored data, makes processing and reporting take longer and is generally rather pointless. For example, a typical syslog can contain hundreds of varied entries related to the kernel most of which are of little consequence. We are likely to only be interested in serious issues such as panics, oops, out-of-memory conditions. It is also worth noting that, in general, any program can insert a syslog entry containing any information it likes so you should never completely trust the data.

Performance Issues

If BuzzSaw is being used to process logs daily on a central server then these filter methods could potentially be called hundreds of thousands of times. Consequently, speed is of the essence, it is worth spending a little time considering if you can achieve your goals with simple string equality checks (e.g. is the program string equal to "kernel") rather than regular expressions. Where regular expressions are required then it is best to use the /o regular expression modifier to ensure it is only compiled once. It is also well worth declaring the regular expressions globally using the qr function. The SSH and Kernel filters which are shipped as part of the BuzzSaw package are good guides to best practice.