You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

SpamAssassin Rules Project

(DRAFT - this part of the wiki is a discussion document, based on emails to dev list. Please feel free to add comments, but be sure to make clear that it's your opinion, by signing your name to them.)

The Problem

Here it is, stated by DuncanFindlay: 'SpamAssassin is not as effective as it could be because of the rules that are being used to detect spam. There are two problems here:

  1. The "not enough rules" problem: SpamAssassin does not have enough high quality spam-catching rules. Anecdotally, our FN ratio seems to be much higher with 3.1 than with 3.0 (we won't know for sure until the mass-checks are done). There may be a variety of reasons for this:
    • The SpamAssassin committers are not spending much time writing rules. Attempts to recruit people to become committers to write rules have been somewhat unsuccessful. We could always use more committers and contributors; what can we do to encourage more contribution?
    • People that do write rules for their own use are not willing to go through the fairly elaborate process in order to submit them to SpamAssassin (this currently requires rules to go through bugzilla and then through 70_testing.cf and eventually into our distribution). What can we do to make this process easier and more inviting?

2. The "release cycle" problem: Any high quality rules that are incorporated into SpamAssassin are not distributed until the next release. Since rules and code are tied together, the release cycle for rules is too long. Submitted rules are not distributed while they are most effective, and rules lose their effectiveness too quickly.'

3. [added by LorenWilton] The instant an actual rule is posted on the user's list, it will lose about 80% of its effectiveness, usually within about 16 hours. Within a week it will be virtually useless. Sometimes the rule will regain some effectiveness a few months later, and in rare cases posting a rule will not affect the hit rate. But in general, public posting in a readable forum of a rule body will negate the usefulness of the rule almost instantly.'

4. [added by BobMenschel from others' discussion] SA rules development handles rules aimed at spam in English best, since most SA rules developers that feed the distribution system speak and correspond in English, and the great majority of the testing corpora are based in English. We're not as good at developing, validating, testing, or scoring rules in other languages.

The Solution

Based on the problem areas outlined above, here are the pages for each aspect of the problem, and proposed solutions:

Repository Organization

  • rules/core/ = standard rules directory
  • rules/sandbox/<username>/ = per-user sandboxes
  • rules/extra/<directory>/ = extra rule sets not in core

The proposal is for rules/core to become the rules directory for trunk (3.2 and later, via SVN externals which will make their inclusion seamless in the standard SA tree). The sandbox is discussed further in RulesProjMoreInput. We'll want to discuss the structure and process behind creating new extras directories further once we reach a critical mass of committers in the rules project.

  • No labels