This Confluence has been LDAP enabled, if you are an ASF Committer, please use your LDAP Credentials to login. Any problems file an INFRA jira ticket please.

Page tree
Skip to end of metadata
Go to start of metadata

These Rescoring Runs Have Finished

This is an old page, left for reference only.

Rescore Mass-checks for Set 2 and Set 3

The bayes+nonet and bayes+net mass-check runs for 3.0.0 have started! Here's the procedure you'll need to follow, if you wish to submit logs for the rescoring run:

First, send mail to <submit.at.spamassassin.org>, and ask for a log-submission account if you haven't already got one.

It's helpful, but not required, to have some or all of the helper applications installed:

  • the Mail::SPF::Query module
  • the Net::DNS module
  • Razor
  • DCC
  • Pyzor

If you're running nightly mass-checks, please feel free to disable them when running the rescore mass-check runs. Also, please note that the nightly submission accounts will work for rescore submissions as well.

Then run these commands:

  wget http://old.SpamAssassin.org/released/Mail-SpamAssassin-3.0.0-pre3.tar.gz
  tar xvfz Mail-SpamAssassin-3.0.0-pre3.tar.gz
  cd Mail-SpamAssassin-3.0.0
  perl Makefile.PL < /dev/null; make

  cd masses
  rm -rf spamassassin; mkdir spamassassin
  echo "use_bayes 1" > spamassassin/user_prefs
  echo "use_auto_whitelist 0" >> spamassassin/user_prefs
  rm ham.log spam.log

  ./mass-check --bayes --net -j 4 --restart=400 --after=1041397200 --all <targets>

<targets> is the list of directories, mboxes, etc., like
spam:dir:~/Mail/spam. See the comments at the top of "mass-check" for details.

This takes a long time to run. Due to Bayes DB lock contention, you will not want to create too many processes running concurrently. -j 2 controls the number of processes to use; 2 should be OK for a single-processor machine, since most of the time there will be one processing while the other is writing to the DB. -j 4 may be good depending on network response speed. Also, if your Bayes DB isn't on an NFS filesystem, you will want to add lock_method flock to the user_prefs file so SpamAssassin can use the more efficient flock locking method.

The --after=1041397200 option tells mass-check to ignore messages older than 18 months ago (in this case January 1 2003). This is useful if your corpus has older messages intermingled with your newer messages.

If you have an unusual network layout, you may need to specify
trusted_networks and/or internal_networks in the spamassassin/user_prefs file. But SA should be able to infer it in most cases. If you get less than a 10% or 15% spam hit rate for RCVD_IN_XBL, then you might need to use these configuration parameters.

Once it finishes:

  USER="[whatever your username is]"
  RSYNC_PASSWORD="[whatever your password is]"
  export RSYNC_PASSWORD

  rsync -CPcvuzb ham.log $USER@rsync.spamassassin.org::submit/ham-bayes-net-$USER.log
  rsync -CPcvuzb spam.log $USER@rsync.spamassassin.org::submit/spam-bayes-net-$USER.log

Next, redo without --net:

  cd masses
  rm -rf spamassassin; mkdir spamassassin
  echo "use_bayes 1" > spamassassin/user_prefs
  echo "use_auto_whitelist 0" >> spamassassin/user_prefs
  rm ham.log spam.log

  ./mass-check --bayes -j 2 --restart=400 --after=1041397200 --all <targets>

See the above notes for other options that may be useful.

Once it finishes:

  USER="[whatever your username is]"
  RSYNC_PASSWORD="[whatever your password is]"
  export RSYNC_PASSWORD

  rsync -CPcvuzb ham.log $USER@rsync.spamassassin.org::submit/ham-bayes-nonet-$USER.log
  rsync -CPcvuzb spam.log $USER@rsync.spamassassin.org::submit/spam-bayes-nonet-$USER.log

That's it!

The results for these two runs will need to be in by Wednesday July 28th, 2004.