10-Fold Cross Validation
This is a log of what I did to run a 10-fold cross - validation test of the perceptron vs the GA when testing bug 2910, http://bugzilla.spamassassin.org/show_bug.cgi?id=2910 (-- JustinMason 21/01/04).
First, I checked out the source:
No Format |
---|
svn co https://svn.apache.org/repos/asf/incubator/spamassassin/trunk
cd trunk
perl Makefile.PL
make
cd masses
|
get pgapack and install as "masses/pgapack". I just scp'd in an already-built tree I had here.
use the set-0 logs from the 2.60 GA run – taken from the rsync repository:
No Format |
---|
wc -l /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/ham-set0.log \
/home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/spam-set0.log
210442 /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/ham-set0.log
354479 home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/spam-set0.log
|
we want about 2k in each bucket, otherwise it'll take weeks to complete. use split-logs-into-buckets (see SplitLogsIntoBuckets) to juggle the log files in blocks of 10% to get the ratio and size to around 2k:2k.
ham buckets first:
No Format |
---|
./tenpass/split-log-into-buckets 10 \
< /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/ham-set0.log
mv split-1.log new
./tenpass/split-log-into-buckets 10 < new
wc -l split-1.log
2104 split-1.log
|
much better!
No Format |
---|
mv split-*.log ../../logs/nonspam-jm/
./tenpass/split-log-into-buckets 10 \
< /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/spam-set0.log
mv split-1.log new
wc -l new
35437 new
|
given this, we want 6 of the 10 logfiles to make 21264 lines, which would result in a roughly even ham:spam ratio for testing. let's do that.
No Format |
---|
cat split-{1,2,3,4,5,6}.log > new
./tenpass/split-log-into-buckets 10 < new
wc -l split-1.log
2126 split-1.log
|
perfect!
No Format |
---|
mv split-*.log ../../logs/spam-jm/
|
and doublecheck the log sizes:
No Format |
---|
wc -l ../../logs/*/*.log
2104 ../../logs/nonspam-jm/split-1.log
2103 ../../logs/nonspam-jm/split-10.log
2106 ../../logs/nonspam-jm/split-2.log
2103 ../../logs/nonspam-jm/split-3.log
2102 ../../logs/nonspam-jm/split-4.log
2105 ../../logs/nonspam-jm/split-5.log
2102 ../../logs/nonspam-jm/split-6.log
2103 ../../logs/nonspam-jm/split-7.log
2103 ../../logs/nonspam-jm/split-8.log
2104 ../../logs/nonspam-jm/split-9.log
2126 ../../logs/spam-jm/split-1.log
2127 ../../logs/spam-jm/split-10.log
2126 ../../logs/spam-jm/split-2.log
2126 ../../logs/spam-jm/split-3.log
2128 ../../logs/spam-jm/split-4.log
2126 ../../logs/spam-jm/split-5.log
2126 ../../logs/spam-jm/split-6.log
2126 ../../logs/spam-jm/split-7.log
2126 ../../logs/spam-jm/split-8.log
2125 ../../logs/spam-jm/split-9.log
42297 total
|
looks fine.
Next step was to ensure the scores in ../rules are usable for the GA. I did this by grepping out all the rules that had scores of 0 in the tested score-set (score-set 0), and creating a new file called ../rules/99_ga_workaround_scores.cf containing lines like so:
No Format |
---|
score NAME_OF_RULE_1 0.0001
score NAME_OF_RULE_2 0.0001
score NAME_OF_RULE_3 0.0001
...
|
This is necessary so that the mass-check log files, which were generated using the same ruleset but possibly with some rules enabled where they are now disabled, will still be useful; this way, all rules in the ruleset are again enabled, and will be considered by the GA as candidates for evolution.
now run the 10pass master script.
No Format |
---|
nohup sh -x ./tenpass/10pass-run &
|
Results will appear in "tenpass_results" – over the course of 4 days.
These will be:
- scores.{1 .. 10}: scores and GA accuracy ratings output by GA
- {ham,spam}.log.{1 .. 10}: validation log files for that set of scores
To perform the validation step, run
No Format |
---|
./tenpass/10pass-compute-tcr
|
This will compute an accuracy rating, using those scores and those validation log files, for the 10 folds. Output looks like:
No Format |
---|
# TCR: 14.173333 SpamRecall: 96.002% SpamPrec: 99.367% FP: 0.31% FN: 2.01%
# TCR: 13.986842 SpamRecall: 96.143% SpamPrec: 99.320% FP: 0.33% FN: 1.94%
# TCR: 15.865672 SpamRecall: 95.579% SpamPrec: 99.608% FP: 0.19% FN: 2.22%
# TCR: 14.173333 SpamRecall: 95.532% SpamPrec: 99.461% FP: 0.26% FN: 2.25%
# TCR: 15.748148 SpamRecall: 95.532% SpamPrec: 99.608% FP: 0.19% FN: 2.25%
# TCR: 12.807229 SpamRecall: 95.014% SpamPrec: 99.409% FP: 0.28% FN: 2.51%
# TCR: 14.561644 SpamRecall: 94.779% SpamPrec: 99.654% FP: 0.17% FN: 2.63%
# TCR: 12.432749 SpamRecall: 94.309% SpamPrec: 99.504% FP: 0.24% FN: 2.86%
# TCR: 14.358108 SpamRecall: 95.859% SpamPrec: 99.414% FP: 0.28% FN: 2.08%
# TCR: 18.318966 SpamRecall: 95.953% SpamPrec: 99.707% FP: 0.14% FN: 2.03%
|
These figures can be compared with other 10FCV runs; they're a good measurement of training accuracy. In other words, they're what you came for.
10-Fold Testing With The Perceptron Instead of GA
If all goes well, the Perceptron will take over from the GA as the main way we generate scores; in that case, this section will be obsolete.
copied ./tenpass/10pass-run to ./10pass-run-perceptron .
Changed these lines:
No Format |
---|
make clean >> make.output
make >> make.output 2>&1
./evolve
pwd; date
|
to
No Format |
---|
make clean >> make.output
make -C perceptron_c clean >> make.output
make tmp/tests.h >> make.output 2>&1
rm -rf perceptron_c/tmp; cp -r tmp perceptron_c/tmp
make -C perceptron_c >> make.output
( cd perceptron_c ; ./perceptron -p 0.75 -e 100 )
pwd; date
|
Change
No Format |
---|
cp craig-evolve.scores tenpass_results/scores.$id
|
to
No Format |
---|
perl -pe 's/^(score\s+\S+\s+)0\s+/$1/gs;' \
< perceptron_c/perceptron.scores \
> tenpass_results/scores.$id
|
(abbreviated "10FCV") is a system for testing trained classifiers. We use it in SpamAssassin development and QA.
The comp.ai.neural-nets FAQ covers it well, in http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-12.html :
No Format |
---|
Cross-validation
++++++++++++++++
In k-fold cross-validation, you divide the data into k subsets of
(approximately) equal size. You train the net k times, each time leaving
out one of the subsets from training, but using only the omitted subset to
compute whatever error criterion interests you. If k equals the sample
size, this is called "leave-one-out" cross-validation. "Leave-v-out" is a
more elaborate and expensive version of cross-validation that involves
leaving out all possible subsets of v cases.
|
In other words, take a testing corpus, divided into ham and spam; each message has previously been hand-verified as being of the correct type (e.g. ham if it's in the ham corpus, spam if in the other one). Divide each corpus into k folds. (In SpamAssassin, we generally use k=10 – which is what pretty much everyone else does anyway, it just seems to work well . Then run these 10 tests:
No Format |
---|
Train classifier on folds: 2 3 4 5 6 7 8 9 10; Test against fold: 1
Train classifier on folds: 1 3 4 5 6 7 8 9 10; Test against fold: 2
Train classifier on folds: 1 2 4 5 6 7 8 9 10; Test against fold: 3
Train classifier on folds: 1 2 3 5 6 7 8 9 10; Test against fold: 4
Train classifier on folds: 1 2 3 4 6 7 8 9 10; Test against fold: 5
Train classifier on folds: 1 2 3 4 5 7 8 9 10; Test against fold: 6
Train classifier on folds: 1 2 3 4 5 6 8 9 10; Test against fold: 7
Train classifier on folds: 1 2 3 4 5 6 7 9 10; Test against fold: 8
Train classifier on folds: 1 2 3 4 5 6 7 8 10; Test against fold: 9
Train classifier on folds: 1 2 3 4 5 6 7 8 9; Test against fold: 10
|
We use 10FCV to test:
- new tweaks to the "Bayesian" learning classifier (the BAYES_* rules)
- new tweaks to the rescoring system (which is also a learning classifier, just at a higher level).
Traditionally, k-fold cross-validation uses a "train on k-1 folds, test on 1 fold"; we use that for testing our rescoring system. However, for the BAYES rules, we use "train on 1 fold, test on k-1 folds", as otherwise it can be hard to get a meaningful number of false positives and false negatives to be able to distinguish improvements in accuracy, because that classifier is very accurate when sufficiently trained.
So, for example,
See RescoreTenFcv for a log of a sample 10-fold CV run against two SpamAssassin rescoring systems (the GA and the perceptron).(required to work around an extra digit output by the perceptron app) and run ./10pass-run-perceptron . This one completes a lot more quickly