You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

10-Fold Cross Validation

[check it out:] svn co https://svn.apache.org/repos/asf/incubator/spamassassin/trunk cd trunk perl Makefile.PL make
cd masses

[also get pgapack and install as "masses/pgapack". I just scp'd in an already-built tree I had here.]

[and use the set-0 logs from the 2.60 GA run – taken from the rsync repository:]

wc -l /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/ham-set0.log /home/cor pus-rsync/corpus/Obsolete/submit-2.60-GA-run1/spam-set0.log
210442 /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/ham-set0.log
354479 /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/spam-set0.log

[we want about 2k in each bucket, otherwise it'll take weeks to complete. use split-logs-into-buckets to juggle the log files in blocks of 10% to get the ratio and size to around 2k:2k.]

[ham buckets first:]

./tenpass/split-log-into-buckets 10 < /home/corpus-rsync/corpus/Obsolete/submit-2.6
0-GA-run1/ham-set0.log
mv split-1.log new
./tenpass/split-log-into-buckets 10 < new
wc -l split-1.log
   2104 split-1.log

[much better!]

mv split-*.log ../../logs/nonspam-jm/

./tenpass/split-log-into-buckets 10 < /home/corpus-rsync/corpus/Obsolete/submit-2.6
0-GA-run1/spam-set0.log
mv split-1.log new
wc -l new
  35437 new

[given this, we want 6 of the 10 logfiles to make 21264 lines, which would result in a roughly even ham:spam ratio for testing. let's do that.]

cat split-{1,2,3,4,5,6}.log > new
./tenpass/split-log-into-buckets 10 < new
wc -l split-1.log
   2126 split-1.log

[perfect!]

mv split-*.log ../../logs/spam-jm/

[and doublecheck the log sizes:]

wc -l ../../logs/*/*.log
   2104 ../../logs/nonspam-jm/split-1.log
   2103 ../../logs/nonspam-jm/split-10.log
   2106 ../../logs/nonspam-jm/split-2.log
   2103 ../../logs/nonspam-jm/split-3.log
   2102 ../../logs/nonspam-jm/split-4.log
   2105 ../../logs/nonspam-jm/split-5.log
   2102 ../../logs/nonspam-jm/split-6.log
   2103 ../../logs/nonspam-jm/split-7.log
   2103 ../../logs/nonspam-jm/split-8.log
   2104 ../../logs/nonspam-jm/split-9.log
   2126 ../../logs/spam-jm/split-1.log
   2127 ../../logs/spam-jm/split-10.log
   2126 ../../logs/spam-jm/split-2.log
   2126 ../../logs/spam-jm/split-3.log
   2128 ../../logs/spam-jm/split-4.log
   2126 ../../logs/spam-jm/split-5.log
   2126 ../../logs/spam-jm/split-6.log
   2126 ../../logs/spam-jm/split-7.log
   2126 ../../logs/spam-jm/split-8.log
   2125 ../../logs/spam-jm/split-9.log
  42297 total

[looks fine. now run the 10pass master script.]

nohup sh -x ./tenpass/10pass-run &

Results will appear in "tenpass_results" – over the course of 4 days. (wink)

THE PERCEPTRON:

copied ./tenpass/10pass-run to ./10pass-run-perceptron .

Changed these lines:

  make clean >> make.output
  make >> make.output 2>&1
  ./evolve
  pwd; date

to

  make clean >> make.output
  make -C perceptron_c clean >> make.output
  make tmp/tests.h >> make.output 2>&1
  rm -rf perceptron_c/tmp; cp -r tmp perceptron_c/tmp
  make -C perceptron_c >> make.output
  ( cd perceptron_c ; ./perceptron )
  pwd; date

Change

cp craig-evolve.scores tenpass_results/scores.$id

to

cp perceptron_c/perceptron.scores tenpass_results/scores.$id

and run ./10pass-run-perceptron . This one runs quicker (wink)

  • No labels