Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

10-Fold Cross Validation

This is a log of what I did to run a 10-fold cross - validation test of the perceptron vs the GA when testing bug 2910, http://bugzilla.spamassassin.org/show_bug.cgi?id=2910 (-- JustinMason 21/01/04).

First, I checked out the source:

No Format

svn co https://svn.apache.org/repos/asf/incubator/spamassassin/trunk
cd trunk
perl Makefile.PL 
make
cd masses

get pgapack and install as "masses/pgapack". I just scp'd in an already-built tree I had here.

use the set-0 logs from the 2.60 GA run – taken from the rsync repository:

No Format

wc -l /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/ham-set0.log \
      /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/spam-set0.log
 210442 /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/ham-set0.log
 354479 home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/spam-set0.log

we want about 2k in each bucket, otherwise it'll take weeks to complete. use split-logs-into-buckets to juggle the log files in blocks of 10% to get the ratio and size to around 2k:2k.

ham buckets first:

No Format

./tenpass/split-log-into-buckets 10 \
    < /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/ham-set0.log
mv split-1.log new
./tenpass/split-log-into-buckets 10 < new
wc -l split-1.log
   2104 split-1.log

much better!

No Format

mv split-*.log ../../logs/nonspam-jm/

./tenpass/split-log-into-buckets 10 \
    < /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/spam-set0.log
mv split-1.log new
wc -l new
  35437 new

given this, we want 6 of the 10 logfiles to make 21264 lines, which would result in a roughly even ham:spam ratio for testing. let's do that.

No Format

cat split-{1,2,3,4,5,6}.log > new
./tenpass/split-log-into-buckets 10 < new
wc -l split-1.log
   2126 split-1.log

perfect!

No Format

mv split-*.log ../../logs/spam-jm/

and doublecheck the log sizes:

No Format

wc -l ../../logs/*/*.log
   2104 ../../logs/nonspam-jm/split-1.log
   2103 ../../logs/nonspam-jm/split-10.log
   2106 ../../logs/nonspam-jm/split-2.log
   2103 ../../logs/nonspam-jm/split-3.log
   2102 ../../logs/nonspam-jm/split-4.log
   2105 ../../logs/nonspam-jm/split-5.log
   2102 ../../logs/nonspam-jm/split-6.log
   2103 ../../logs/nonspam-jm/split-7.log
   2103 ../../logs/nonspam-jm/split-8.log
   2104 ../../logs/nonspam-jm/split-9.log
   2126 ../../logs/spam-jm/split-1.log
   2127 ../../logs/spam-jm/split-10.log
   2126 ../../logs/spam-jm/split-2.log
   2126 ../../logs/spam-jm/split-3.log
   2128 ../../logs/spam-jm/split-4.log
   2126 ../../logs/spam-jm/split-5.log
   2126 ../../logs/spam-jm/split-6.log
   2126 ../../logs/spam-jm/split-7.log
   2126 ../../logs/spam-jm/split-8.log
   2125 ../../logs/spam-jm/split-9.log
  42297 total

looks fine.

Next step was to ensure the scores in ../rules are usable for the GA. I did this by grepping out all the rules that had scores of 0 in the tested score-set (score-set 0), and creating a new file called ../rules/99_ga_workaround_scores.cf containing lines like so:

No Format

   score NAME_OF_RULE_1   0.0001
   score NAME_OF_RULE_2   0.0001
   score NAME_OF_RULE_3   0.0001
...

This is necessary so that the mass-check log files, which were generated using the same ruleset but possibly with some rules enabled where they are now disabled, will still be useful; this way, all rules in the ruleset are again enabled, and will be considered by the GA as candidates for evolution.

now run the 10pass master script.

No Format

nohup sh -x ./tenpass/10pass-run &

Results will appear in "tenpass_results" – over the course of 4 days. (wink)

These will be:

  • scores.{1 .. 10}: scores and GA accuracy ratings output by GA
  • {ham,spam}.log.{1 .. 10}: validation log files for that set of scores

To perform the validation step, run

No Format

./tenpass/10pass-compute-tcr

This will compute an accuracy rating, using those scores and those validation log files, for the 10 folds. Output looks like:

No Format

# TCR: 14.173333  SpamRecall: 96.002%  SpamPrec: 99.367%  FP: 0.31%  FN: 2.01%
# TCR: 13.986842  SpamRecall: 96.143%  SpamPrec: 99.320%  FP: 0.33%  FN: 1.94%
# TCR: 15.865672  SpamRecall: 95.579%  SpamPrec: 99.608%  FP: 0.19%  FN: 2.22%
# TCR: 14.173333  SpamRecall: 95.532%  SpamPrec: 99.461%  FP: 0.26%  FN: 2.25%
# TCR: 15.748148  SpamRecall: 95.532%  SpamPrec: 99.608%  FP: 0.19%  FN: 2.25%
# TCR: 12.807229  SpamRecall: 95.014%  SpamPrec: 99.409%  FP: 0.28%  FN: 2.51%
# TCR: 14.561644  SpamRecall: 94.779%  SpamPrec: 99.654%  FP: 0.17%  FN: 2.63%
# TCR: 12.432749  SpamRecall: 94.309%  SpamPrec: 99.504%  FP: 0.24%  FN: 2.86%
# TCR: 14.358108  SpamRecall: 95.859%  SpamPrec: 99.414%  FP: 0.28%  FN: 2.08%
# TCR: 18.318966  SpamRecall: 95.953%  SpamPrec: 99.707%  FP: 0.14%  FN: 2.03%

These figures can be compared with other 10FCV runs; they're a good measurement of training accuracy. In other words, they're what you came for. (wink)

10-Fold Testing With The Perceptron Instead of GA

If all goes well, the Perceptron will take over from the GA as the main way we generate scores; in that case, this section will be obsolete.

copied ./tenpass/10pass-run to ./10pass-run-perceptron .

Changed these lines:

No Format

  make clean >> make.output
  make >> make.output 2>&1
  ./evolve
  pwd; date

to

No Format

  make clean >> make.output
  make -C perceptron_c clean >> make.output
  make tmp/tests.h >> make.output 2>&1
  rm -rf perceptron_c/tmp; cp -r tmp perceptron_c/tmp
  make -C perceptron_c >> make.output
  ( cd perceptron_c ; ./perceptron -p 0.75 -e 100 )
  pwd; date

Change

No Format

  cp craig-evolve.scores tenpass_results/scores.$id

to

No Format

  perl -pe 's/^(score\s+\S+\s+)0\s+/$1/gs;' \
      < perceptron_c/perceptron.scores \
      > tenpass_results/scores.$id

(abbreviated "10FCV") is a system for testing trained classifiers. We use it in SpamAssassin development and QA.

The comp.ai.neural-nets FAQ covers it well, in http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-12.html :

No Format

Cross-validation
++++++++++++++++

In k-fold cross-validation, you divide the data into k subsets of
(approximately) equal size. You train the net k times, each time leaving
out one of the subsets from training, but using only the omitted subset to
compute whatever error criterion interests you. If k equals the sample
size, this is called "leave-one-out" cross-validation. "Leave-v-out" is a
more elaborate and expensive version of cross-validation that involves
leaving out all possible subsets of v cases.

In other words, take a testing corpus, divided into ham and spam; each message has previously been hand-verified as being of the correct type (e.g. ham if it's in the ham corpus, spam if in the other one). Divide each corpus into k folds. (In SpamAssassin, we generally use k=10 – which is what pretty much everyone else does anyway, it just seems to work well (wink). Then run these 10 tests:

No Format

Train classifier on folds: 2 3 4 5 6 7 8 9 10; Test against fold: 1
Train classifier on folds: 1 3 4 5 6 7 8 9 10; Test against fold: 2
Train classifier on folds: 1 2 4 5 6 7 8 9 10; Test against fold: 3
Train classifier on folds: 1 2 3 5 6 7 8 9 10; Test against fold: 4
Train classifier on folds: 1 2 3 4 6 7 8 9 10; Test against fold: 5
Train classifier on folds: 1 2 3 4 5 7 8 9 10; Test against fold: 6
Train classifier on folds: 1 2 3 4 5 6 8 9 10; Test against fold: 7
Train classifier on folds: 1 2 3 4 5 6 7 9 10; Test against fold: 8
Train classifier on folds: 1 2 3 4 5 6 7 8 10; Test against fold: 9
Train classifier on folds: 1 2 3 4 5 6 7 8 9;  Test against fold: 10   

We use 10FCV to test:

  • new tweaks to the "Bayesian" learning classifier (the BAYES_* rules)
  • new tweaks to the rescoring system (which is also a learning classifier, just at a higher level).

Traditionally, k-fold cross-validation uses a "train on k-1 folds, test on 1 fold"; we use that for testing our rescoring system. However, for the BAYES rules, we use "train on 1 fold, test on k-1 folds", as otherwise it can be hard to get a meaningful number of false positives and false negatives to be able to distinguish improvements in accuracy, because that classifier is very accurate when sufficiently trained.

So, for example,

See RescoreTenFcv for a log of a sample 10-fold CV run against two SpamAssassin rescoring systems (the GA and the perceptron).(required to work around an extra digit output by the perceptron app) and run ./10pass-run-perceptron . This one completes a lot more quickly (wink)