These Rescoring Runs Have Finished
This is an old page, left for reference only.
Rescore Mass-checks for Set 2 and Set 3
The bayes+nonet and bayes+net mass-check runs for 3.0.0 have started! Here's the procedure you'll need to follow, if you wish to submit logs for the rescoring run:
First, send mail to <submit.at.spamassassin.org>, and ask for a log-submission account if you haven't already got one.
It's helpful, but not required, to have some or all of the helper applications installed:
- the Mail::SPF::Query module
- the Net::DNS module
If you're running nightly mass-checks, please feel free to disable them when running the rescore mass-check runs. Also, please note that the nightly submission accounts will work for rescore submissions as well.
Then run these commands:
<targets> is the list of directories, mboxes, etc., like
spam:dir:~/Mail/spam. See the comments at the top of "mass-check" for details.
This takes a long time to run. Due to Bayes DB lock contention, you will not want to create too many processes running concurrently.
-j 2 controls the number of processes to use; 2 should be OK for a single-processor machine, since most of the time there will be one processing while the other is writing to the DB.
-j 4 may be good depending on network response speed. Also, if your Bayes DB isn't on an NFS filesystem, you will want to add
lock_method flock to the user_prefs file so SpamAssassin can use the more efficient flock locking method.
--after=1041397200 option tells mass-check to ignore messages older than 18 months ago (in this case January 1 2003). This is useful if your corpus has older messages intermingled with your newer messages.
If you have an unusual network layout, you may need to specify
internal_networks in the
spamassassin/user_prefs file. But SA should be able to infer it in most cases. If you get less than a 10% or 15% spam hit rate for RCVD_IN_XBL, then you might need to use these configuration parameters.
Once it finishes:
Next, redo without --net:
See the above notes for other options that may be useful.
Once it finishes:
The results for these two runs will need to be in by Wednesday July 28th, 2004.