Spam detection mailets using bayesian analysis techniques
BayesianAnalysis mailet
The B'ayesianAnalysis_* mailet scans a message and determines the probability that it is *spam_', using bayesian probability theory techniques.
It is based upon the principals described in A Plan For Spam (http://www.paulgraham.com/spam.html) by Paul Graham, and has been extended to his Better Bayesian Filtering (http://paulgraham.com/better.html).
The analysis capabilities are based on token frequencies (the corpus) learned through a training process using the B'_*ayesianAnalysisFeeder mailet (see below) and stored in a JDBC database. During mailet initialization the corpus is loaded (built) from the database and kept in memory.
After a training session, the corpus must be rebuilt from the database in order to acquire the new frequencies. Every 10 minutes a special thread will check if any change was made to the database by the feeder, and rebuild the corpus for this mailet if necessary.
A_ 'org.apache.james.spam.probability* mail attribute will be created containing the computed spam probability as a java.lang.Double. A message header string named as specified in the headerName init parameter will be created containing such probability in floating point representation.
Initialization Parameters
The init parameters are as follows:
- <repositoryPath>: an url pointing to the <data-source> containing the database tables used (typically db://maildb).
- <headerName>: the header name to add with the spam probability (default is X-M'*_essageIsSpamProbability).
- <ignoreLocalSender>*:_ true_ if you want to ignore messages coming from local senders (default is_ false_). By_ local sender_ we mean a_ return-path_ with a local server part (server listed in <servernames> in config.xml)..
- <maxSize>*: the maximum message size (in bytes) that a message may have to be considered spam (default is_ 100000_).
The probability of being spam is pre-pended to the subject if it is > 0.1 (10%).
The required tables are automatically created if not already there (see sqlResources.xml). The token field in both the ham and spam tables is* case sensitive*.
A James config.xml example
Here follows an example of* config.xml* definitions deploying the analysis mailet:
... <mailet match="All" class="BayesianAnalysis" onMailetException="ignore"> <repositoryPath>db://maildb</repositoryPath> <maxSize>200000</maxSize> <headerName>X-MessageIsSpamProbability</headerName> <ignoreLocalSender>true</ignoreLocalSender> </mailet> <mailet match="CompareNumericHeaderValue=X-MessageIsSpamProbability > 0.90" class="AddHeader" onMatchException="noMatch"> <name>X-MessageIsSpam</name> <value>true</value> </mailet> <mailet match="CompareNumericHeaderValue=X-MessageIsSpamProbability > 0.99" class="ToProcessor" onMatchException="noMatch"> <processor> spam </processor> <notice>Spam not accepted</notice> </mailet> ...
BayesianAnalysisFeeder mailet
The B'ayesianAnalysisFeeder_* mailet feeds ham OR spam messages to train the B_'ayesianAnalysis* mailet (see above).
The new token frequencies are stored in a JDBC database.
The bayesian database tables are updated during the training reflecting the new data. At the end the mail is destroyed (ghosted).*
The correct approach is to send the original ham/spam message as an attachment to another message sent to the feeder; all the headers of the enveloping message will be removed and only the original message's tokens will be analyzed and used for feeding*. This because 'all* the tokens of a message are examined by the B*'ayesianAnalysis mailet (including headers), and hence the feeding process must be consistent.
After a training session, the frequency_ corpus_ used by the B'ayesianAnalysis mailet must be rebuilt from the database, in order to take advantage of the new token frequencies. Every 10 minutes a special thread in the B'ayesianAnalysis mailet will check if any change was made to the database, and rebuild its corpus if necessary.
Only one message at a time is scanned (the database update activity is_ synchronized_) in order to avoid too much database locking, as thousands of rows may be updated just for one message being fed.
Initialization Parameters
The init parameters are as follows:
-
- <repositoryPath>*: an url pointing to the <data-source> containing the database tables used (typically_ db://maildb_).
- <feedType>*: the type of message being fed. The possible values are either_ ham_ (good messages) or_ spam_.
- <maxSize>*: the maximum message size (in bytes) that a message may have to be considered spam (default is_ 100000_).
A James config.xml example
Here follows an example of* config.xml_' definitions deploying the feeder mailet:
... <!-- "not spam" bayesian analysis feeder. --> <mailet match="RecipientIs=not.spam@thisdomain.com" class="BayesianAnalysisFeeder"> <repositoryPath> db://maildb </repositoryPath> <feedType>ham</feedType> <maxSize>200000</maxSize> </mailet> <!-- "spam" bayesian analysis feeder. --> <mailet match="RecipientIs=spam@thisdomain.com" class="BayesianAnalysisFeeder"> <repositoryPath> db://maildb </repositoryPath> <feedType>spam</feedType> <maxSize>200000</maxSize> </mailet> ...
The previous example will allow the user to send messages to the server and use the recipient email address as the indicator for whether the message is ham or spam.
Using the example above, send good messages (ham not spam) to the email address "not.spam@thisdomain.com" to pump good messages into the feeder, and send spam messages (spam not ham) to the email address "spam@thisdomain.com" to pump spam messages into the feeder. It is a good idea to activate SMTP AUTH and replace thisdomain.com with a domain not listed as a server in <servernames> in config.xml: this way only authenticated users can feed the corpus. An example of addresses to use could be "ham@bayes.feeder" and "spam@bayes.feeder".