10-Fold Cross Validation

This is a log of what I did to run a 10-fold cross-validation test of the perceptron vs the GA when testing bug 2910, http://bugzilla.spamassassin.org/show_bug.cgi?id=2910 . Note that many things have changed since this log was created, and an attempt to re-run these commands verbatim will probably fail. (-- JustinMason 21/01/04).

First, I checked out the source:

svn co https://svn.apache.org/repos/asf/incubator/spamassassin/trunk
cd trunk
perl Makefile.PL 
cd masses

(Update Jul 2007: nowadays, that's at https://svn.apache.org/repos/asf/spamassassin/trunk .)

get pgapack and install as "masses/pgapack". I just scp'd in an already-built tree I had here.

use the set-0 logs from the 2.60 GA run – taken from the rsync repository:

wc -l /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/ham-set0.log \
 210442 /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/ham-set0.log
 354479 home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/spam-set0.log

we want about 2k in each bucket, otherwise it'll take weeks to complete. use split-logs-into-buckets (see SplitLogsIntoBuckets) to juggle the log files in blocks of 10% to get the ratio and size to around 2k:2k.

ham buckets first:

./tenpass/split-log-into-buckets 10 \
    < /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/ham-set0.log
mv split-1.log new
./tenpass/split-log-into-buckets 10 < new
wc -l split-1.log
   2104 split-1.log

much better!

mv split-*.log ../../logs/nonspam-jm/

./tenpass/split-log-into-buckets 10 \
    < /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/spam-set0.log
mv split-1.log new
wc -l new
  35437 new

given this, we want 6 of the 10 logfiles to make 21264 lines, which would result in a roughly even ham:spam ratio for testing. let's do that.

cat split-{1,2,3,4,5,6}.log > new
./tenpass/split-log-into-buckets 10 < new
wc -l split-1.log
   2126 split-1.log


mv split-*.log ../../logs/spam-jm/

and doublecheck the log sizes:

wc -l ../../logs/*/*.log
   2104 ../../logs/nonspam-jm/split-1.log
   2103 ../../logs/nonspam-jm/split-10.log
   2106 ../../logs/nonspam-jm/split-2.log
   2103 ../../logs/nonspam-jm/split-3.log
   2102 ../../logs/nonspam-jm/split-4.log
   2105 ../../logs/nonspam-jm/split-5.log
   2102 ../../logs/nonspam-jm/split-6.log
   2103 ../../logs/nonspam-jm/split-7.log
   2103 ../../logs/nonspam-jm/split-8.log
   2104 ../../logs/nonspam-jm/split-9.log
   2126 ../../logs/spam-jm/split-1.log
   2127 ../../logs/spam-jm/split-10.log
   2126 ../../logs/spam-jm/split-2.log
   2126 ../../logs/spam-jm/split-3.log
   2128 ../../logs/spam-jm/split-4.log
   2126 ../../logs/spam-jm/split-5.log
   2126 ../../logs/spam-jm/split-6.log
   2126 ../../logs/spam-jm/split-7.log
   2126 ../../logs/spam-jm/split-8.log
   2125 ../../logs/spam-jm/split-9.log
  42297 total

looks fine.

Next step was to ensure the scores in ../rules are usable for the GA. I did this by grepping out all the rules that had scores of 0 in the tested score-set (score-set 0), and creating a new file called ../rules/99_ga_workaround_scores.cf containing lines like so:

   score NAME_OF_RULE_1   0.0001
   score NAME_OF_RULE_2   0.0001
   score NAME_OF_RULE_3   0.0001

This is necessary so that the mass-check log files, which were generated using the same ruleset but possibly with some rules enabled where they are now disabled, will still be useful; this way, all rules in the ruleset are again enabled, and will be considered by the GA as candidates for evolution.

now run the 10pass master script.

nohup sh -x ./tenpass/10pass-run &

Results will appear in "tenpass_results" – over the course of 4 days. (wink)

These will be:

  • scores.{1 .. 10}: scores and GA accuracy ratings output by GA
  • {ham,spam}.log.{1 .. 10}: validation log files for that set of scores

To perform the validation step, run


This will compute an accuracy rating, using those scores and those validation log files, for the 10 folds. Output looks like:

# TCR: 14.173333  SpamRecall: 96.002%  SpamPrec: 99.367%  FP: 0.31%  FN: 2.01%
# TCR: 13.986842  SpamRecall: 96.143%  SpamPrec: 99.320%  FP: 0.33%  FN: 1.94%
# TCR: 15.865672  SpamRecall: 95.579%  SpamPrec: 99.608%  FP: 0.19%  FN: 2.22%
# TCR: 14.173333  SpamRecall: 95.532%  SpamPrec: 99.461%  FP: 0.26%  FN: 2.25%
# TCR: 15.748148  SpamRecall: 95.532%  SpamPrec: 99.608%  FP: 0.19%  FN: 2.25%
# TCR: 12.807229  SpamRecall: 95.014%  SpamPrec: 99.409%  FP: 0.28%  FN: 2.51%
# TCR: 14.561644  SpamRecall: 94.779%  SpamPrec: 99.654%  FP: 0.17%  FN: 2.63%
# TCR: 12.432749  SpamRecall: 94.309%  SpamPrec: 99.504%  FP: 0.24%  FN: 2.86%
# TCR: 14.358108  SpamRecall: 95.859%  SpamPrec: 99.414%  FP: 0.28%  FN: 2.08%
# TCR: 18.318966  SpamRecall: 95.953%  SpamPrec: 99.707%  FP: 0.14%  FN: 2.03%

These figures can be compared with other 10FCV runs; they're a good measurement of training accuracy. In other words, they're what you came for. (wink)

10-Fold Testing With The Perceptron Instead of GA

If all goes well, the Perceptron will take over from the GA as the main way we generate scores; in that case, this section will be obsolete.

copied ./tenpass/10pass-run to ./10pass-run-perceptron .

Changed these lines:

  make clean >> make.output
  make >> make.output 2>&1
  pwd; date


  make clean >> make.output
  make -C perceptron_c clean >> make.output
  make tmp/tests.h >> make.output 2>&1
  rm -rf perceptron_c/tmp; cp -r tmp perceptron_c/tmp
  make -C perceptron_c >> make.output
  ( cd perceptron_c ; ./perceptron -p 0.75 -e 100 )
  pwd; date


  cp craig-evolve.scores tenpass_results/scores.$id


  perl -pe 's/^(score\s+\S+\s+)0\s+/$1/gs;' \
      < perceptron_c/perceptron.scores \
      > tenpass_results/scores.$id

(required to work around an extra digit output by the perceptron app) and run ./10pass-run-perceptron . This one completes a lot more quickly (wink)