This Confluence has been LDAP enabled, if you are an ASF Committer, please use your LDAP Credentials to login. Any problems file an INFRA jira ticket please.

Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: missing edit-log entry for this revision

10-Fold Cross Validation

This is a log of what I did to run a 10-fold cross-validation test of the perceptron vs the GA when testing bug 2910 ( http://bugzilla.spamassassin.org/show_bug.cgi?id=2910 ) – JustinMason 21/01/04

Wiki Markup
\[check it out:\]

No Format

svn co https://svn.apache.org/repos/asf/incubator/spamassassin/trunk
cd trunk
perl Makefile.PL 
make
cd masses

Wiki Markup
\[also get pgapack and install as "masses/pgapack".  I just scp'd in an already-built tree I had here.\]

Wiki Markup
\[and use the set-0 logs from the 2.60 GA run -- taken from the rsync repository:\]

No Format

wc -l /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/ham-set0.log /home/

...

corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/spam-set0.log

...


 210442 /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/ham-set0.log

...


 354479 /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/spam-set0.log

Wiki Markup
\[we want about 2k in each bucket, otherwise it'll take weeks to complete. use split-logs-into-buckets to juggle the log files in blocks of 10% to get the ratio and size to around 2k:2k.\]

...

mv split-*.log ../../logs/spam-jm/
}}}

Wiki Markup
\[and doublecheck the log sizes:\]

...

Wiki Markup
\[looks fine.  now run the 10pass master script.\]

No Format

nohup sh -x ./tenpass/10pass-run &

Results will appear in "tenpass_results" – over the course of 4 days. (wink)

...

No Format
  make clean >> make.output
  make -C perceptron_c clean >> make.output
  make tmp/tests.h >> make.output 2>&1
  rm -rf perceptron_c/tmp; cp -r tmp perceptron_c/tmp
  make -C perceptron_c >> make.output
  ( cd perceptron_c ; ./perceptron )
  pwd; date

Change

No Format

  cp craig-evolve.scores tenpass_results/scores.$id

to

No Format

  cp perceptron_c/perceptron.scores tenpass_results/scores.$id

and run ./10pass-run-perceptron . This one runs quicker (wink)