This Confluence has been LDAP enabled, if you are an ASF Committer, please use your LDAP Credentials to login. Any problems file an INFRA jira ticket please.

Page tree
Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 11 Next »

10-Fold Cross Validation

10-fold cross validation (abbreviated "10FCV") is a system for testing trained classifiers. We use it in SpamAssassin development and QA.

The FAQ covers it well, in :


In k-fold cross-validation, you divide the data into k subsets of
(approximately) equal size. You train the net k times, each time leaving
out one of the subsets from training, but using only the omitted subset to
compute whatever error criterion interests you. If k equals the sample
size, this is called "leave-one-out" cross-validation. "Leave-v-out" is a
more elaborate and expensive version of cross-validation that involves
leaving out all possible subsets of v cases.

In SpamAssassin, we generally use k=10 – as pretty much everyone else does anyway (wink) – and we use 10FCV to test:

  • new tweaks to the "Bayesian" learning classifier (the BAYES_* rules)
  • new tweaks to the rescoring system (which is also a learning classifier, just at a higher level).

Traditionally, k-fold cross-validation uses a "train on k-1 folds, test on 1 fold"; we use that for testing our rescoring system. However, for the BAYES rules, we use "train on 1 fold, test on k-1 folds", as otherwise it can be hard to get a meaningful number of false positives and false negatives to be able to distinguish improvements in accuracy, because that classifier is very accurate when sufficiently trained.

See RescoreTenFcv for a log of a sample 10-fold CV run against two SpamAssassin rescoring systems (the GA and the perceptron).

  • No labels