10-Fold Cross Validation

This is a log of what I did to run a 10-fold cross -validation test of the perceptron vs the GA when testing bug 2910, validation (abbreviated "10FCV") is a system for testing trained classifiers. We use it in SpamAssassin development and QA.

The comp.ai.neural-nets FAQ covers it well, in http://bugzillawww.spamassassinfaqs.org/show_bug.cgi?id=2910 (-- JustinMason 21/01/04).

First, I checked out the source:

No Format
svn co https://svn.apache.org/repos/asf/incubator/spamassassin/trunk cd trunk perl Makefile.PL make cd masses

get pgapack and install as "masses/pgapack". I just scp'd in an already-built tree I had here.

use the set-0 logs from the 2.60 GA run – taken from the rsync repository:

No Format


wc -l /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/ham-set0.log /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/spam-set0.log
 210442 /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/ham-set0.log
 354479 /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/spam-set0.log

we want about 2k in each bucket, otherwise it'll take weeks to complete. use split-logs-into-buckets to juggle the log files in blocks of 10% to get the ratio and size to around 2k:2k.

ham buckets first:

No Format


./tenpass/split-log-into-buckets 10 < /home/corpus-rsync/corpus/Obsolete/submit-2.6
0-GA-run1/ham-set0.log
mv split-1.log new
./tenpass/split-log-into-buckets 10 < new
wc -l split-1.log
   2104 split-1.log

much better!

No Format
mv split-*.log ../../logs/nonspam-jm/ ./tenpass/split-log-into-buckets 10 < /home/corpus-rsync/corpus/Obsolete/submit-2.6 0-GA-run1/spam-set0.log mv split-1.log new wc -l new 35437 new

given this, we want 6 of the 10 logfiles to make 21264 lines, which would result in a roughly even ham:spam ratio for testing. let's do that.

No Format
cat split-{1,2,3,4,5,6}.log > new ./tenpass/split-log-into-buckets 10 < new wc -l split-1.log 2126 split-1.log

perfect!

mv split-*.log ../../logs/spam-jm/
}}}

and doublecheck the log sizes:

No Format


wc -l ../../logs/*/*.log
   2104 ../../logs/nonspam-jm/split-1.log
   2103 ../../logs/nonspam-jm/split-10.log
   2106 ../../logs/nonspam-jm/split-2.log
   2103 ../../logs/nonspam-jm/split-3.log
   2102 ../../logs/nonspam-jm/split-4.log
   2105 ../../logs/nonspam-jm/split-5.log
   2102 ../../logs/nonspam-jm/split-6.log
   2103 ../../logs/nonspam-jm/split-7.log
   2103 ../../logs/nonspam-jm/split-8.log
   2104 ../../logs/nonspam-jm/split-9.log
   2126 ../../logs/spam-jm/split-1.log
   2127 ../../logs/spam-jm/split-10.log
   2126 ../../logs/spam-jm/split-2.log
   2126 ../../logs/spam-jm/split-3.log
   2128 ../../logs/spam-jm/split-4.log
   2126 ../../logs/spam-jm/split-5.log
   2126 ../../logs/spam-jm/split-6.log
   2126 ../../logs/spam-jm/split-7.log
   2126 ../../logs/spam-jm/split-8.log
   2125 ../../logs/spam-jm/split-9.log
  42297 total

looks fine. now run the 10pass master script.

No Format
nohup sh -x ./tenpass/10pass-run &

Results will appear in "tenpass_results" – over the course of 4 days.

10-Fold Testing With The Perceptron Instead of GA

If all goes well, the Perceptron will take over from the GA as the main way we generate scores; in that case, this section will be obsolete.

copied ./tenpass/10pass-run to ./10pass-run-perceptron .

Changed these lines:

No Format
make clean >> make.output make >> make.output 2>&1 ./evolve pwd; date

to

No Format


  make clean >> make.output
  make -C perceptron_c clean >> make.output
  make tmp/tests.h >> make.output 2>&1
  rm -rf perceptron_c/tmp; cp -r tmp perceptron_c/tmp
  make -C perceptron_c >> make.output
  ( cd perceptron_c ; ./perceptron -p 0.75 -e 100 )
  pwd; date

Change

No Format
cp craig-evolve.scores tenpass_results/scores.$id

to

No Format
perl -pe 's/^(score\s+\S+\s+)0\s+/$1/gs;' \ < perceptron_c/perceptron.scores \ > tenpass_results/scores.$id

/faqs/ai-faq/neural-nets/part3/section-12.html :

No Format


Cross-validation
++++++++++++++++

In k-fold cross-validation, you divide the data into k subsets of
(approximately) equal size. You train the net k times, each time leaving
out one of the subsets from training, but using only the omitted subset to
compute whatever error criterion interests you. If k equals the sample
size, this is called "leave-one-out" cross-validation. "Leave-v-out" is a
more elaborate and expensive version of cross-validation that involves
leaving out all possible subsets of v cases.

In other words, take a testing corpus, divided into ham and spam; each message has previously been hand-verified as being of the correct type (e.g. ham if it's in the ham corpus, spam if in the other one). Divide each corpus into k folds. (In SpamAssassin, we generally use k=10 – which is what pretty much everyone else does anyway, it just seems to work well . Then run these 10 tests:

No Format


Train classifier on folds: 2 3 4 5 6 7 8 9 10; Test against fold: 1
Train classifier on folds: 1 3 4 5 6 7 8 9 10; Test against fold: 2
Train classifier on folds: 1 2 4 5 6 7 8 9 10; Test against fold: 3
Train classifier on folds: 1 2 3 5 6 7 8 9 10; Test against fold: 4
Train classifier on folds: 1 2 3 4 6 7 8 9 10; Test against fold: 5
Train classifier on folds: 1 2 3 4 5 7 8 9 10; Test against fold: 6
Train classifier on folds: 1 2 3 4 5 6 8 9 10; Test against fold: 7
Train classifier on folds: 1 2 3 4 5 6 7 9 10; Test against fold: 8
Train classifier on folds: 1 2 3 4 5 6 7 8 10; Test against fold: 9
Train classifier on folds: 1 2 3 4 5 6 7 8 9;  Test against fold: 10

We use 10FCV to test:

new tweaks to the "Bayesian" learning classifier (the BAYES_* rules)
new tweaks to the rescoring system (which is also a learning classifier, just at a higher level).

Traditionally, k-fold cross-validation uses a "train on k-1 folds, test on 1 fold"; we use that for testing our rescoring system. However, for the BAYES rules, we use "train on 1 fold, test on k-1 folds", as otherwise it can be hard to get a meaningful number of false positives and false negatives to be able to distinguish improvements in accuracy, because that classifier is very accurate when sufficiently trained.

So, for example,

See RescoreTenFcv for a log of a sample 10-fold CV run against two SpamAssassin rescoring systems (the GA and the perceptron).(required to work around an extra digit output by the perceptron app) and run ./10pass-run-perceptron . This one completes a lot more quickly

Child pages

Versions Compared

Old Version 4

New Version Current

Key

10-Fold Cross Validation

10-Fold Testing With The Perceptron Instead of GA

Child pages

Page History

Versions Compared

Old Version 4

New Version Current

Key

10-Fold Cross Validation

10-Fold Testing With The Perceptron Instead of GA