...
This is a log of what I did to run a 10-fold cross-validation test of the perceptron vs the GA when testing bug 2910 ( http://bugzilla.spamassassin.org/show_bug.cgi?id=2910 ) – JustinMason 21/01/04
\[check it out:\]First, I checked out the source: Wiki Markup
No Format |
---|
svn co https://svn.apache.org/repos/asf/incubator/spamassassin/trunk cd trunk perl Makefile.PL make cd masses |
...
\[also get pgapack and install as "masses/pgapack". I just scp'd in an already-built tree I had here.\]
\[and use the set-0 logs from the 2.60 GA run -- – taken from the rsync repository:\] Wiki Markup
No Format |
---|
wc -l /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/ham-set0.log /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/spam-set0.log 210442 /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/ham-set0.log 354479 /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/spam-set0.log |
\[we want about 2k in each bucket, otherwise it'll take weeks to complete. use split-logs-into-buckets to juggle the log files in blocks of 10% to get the ratio and size to around 2k:2k.\]unmigrated-wiki-markup Wiki Markup
\[ham buckets first:\]
No Format |
---|
./tenpass/split-log-into-buckets 10 < /home/corpus-rsync/corpus/Obsolete/submit-2.6 0-GA-run1/ham-set0.log mv split-1.log new ./tenpass/split-log-into-buckets 10 < new wc -l split-1.log 2104 split-1.log |
\[much better!\] Wiki Markup
No Format |
---|
mv split-*.log ../../logs/nonspam-jm/ ./tenpass/split-log-into-buckets 10 < /home/corpus-rsync/corpus/Obsolete/submit-2.6 0-GA-run1/spam-set0.log mv split-1.log new wc -l new 35437 new |
...
\[given this, we want 6 of the 10 logfiles to make 21264 lines, which would result in a roughly even ham:spam ratio for testing. let's do that.\]
No Format |
---|
cat split-{1,2,3,4,5,6}.log > new ./tenpass/split-log-into-buckets 10 < new wc -l split-1.log 2126 split-1.log |
...
\[perfect!\]
mv split-*.log ../../logs/spam-jm/
}}}
\[and doublecheck the log sizes:\] Wiki Markup
No Format |
---|
wc -l ../../logs/*/*.log 2104 ../../logs/nonspam-jm/split-1.log 2103 ../../logs/nonspam-jm/split-10.log 2106 ../../logs/nonspam-jm/split-2.log 2103 ../../logs/nonspam-jm/split-3.log 2102 ../../logs/nonspam-jm/split-4.log 2105 ../../logs/nonspam-jm/split-5.log 2102 ../../logs/nonspam-jm/split-6.log 2103 ../../logs/nonspam-jm/split-7.log 2103 ../../logs/nonspam-jm/split-8.log 2104 ../../logs/nonspam-jm/split-9.log 2126 ../../logs/spam-jm/split-1.log 2127 ../../logs/spam-jm/split-10.log 2126 ../../logs/spam-jm/split-2.log 2126 ../../logs/spam-jm/split-3.log 2128 ../../logs/spam-jm/split-4.log 2126 ../../logs/spam-jm/split-5.log 2126 ../../logs/spam-jm/split-6.log 2126 ../../logs/spam-jm/split-7.log 2126 ../../logs/spam-jm/split-8.log 2125 ../../logs/spam-jm/split-9.log 42297 total |
\[looks fine. now run the 10pass master script.\] Wiki Markup
No Format |
---|
nohup sh -x ./tenpass/10pass-run & |
...