10-Fold Cross Validation

10-fold cross validation (abbreviated "10FCV") is a system for testing trained classifiers. We use it in SpamAssassin development and QA.

The comp.ai.neural-nets FAQ covers it well, in http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-12.html :

Cross-validation
++++++++++++++++

In k-fold cross-validation, you divide the data into k subsets of
(approximately) equal size. You train the net k times, each time leaving
out one of the subsets from training, but using only the omitted subset to
compute whatever error criterion interests you. If k equals the sample
size, this is called "leave-one-out" cross-validation. "Leave-v-out" is a
more elaborate and expensive version of cross-validation that involves
leaving out all possible subsets of v cases.

In SpamAssassin, we generally use k=10 – as pretty much everyone else does anyway – and we use 10FCV to test:

new tweaks to the "Bayesian" learning classifier (the BAYES_* rules)
new tweaks to the rescoring system (which is also a learning classifier, just at a higher level).

Traditionally, k-fold cross-validation uses a "train on k-1 folds, test on 1 fold"; we use that for testing our rescoring system. However, for the BAYES rules, we use "train on 1 fold, test on k-1 folds", as otherwise it can be hard to get a meaningful number of false positives and false negatives to be able to distinguish improvements in accuracy, because that classifier is very accurate when sufficiently trained.

See RescoreTenFcv for a log of a sample 10-fold CV run against two SpamAssassin rescoring systems (the GA and the perceptron).

Child pages

TenFoldCrossValidation

10-Fold Cross Validation