This Confluence has been LDAP enabled, if you are an ASF Committer, please use your LDAP Credentials to login. Any problems file an INFRA jira ticket please.

Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: missing edit-log entry for this revision

...

This is a log of what I did to run a 10-fold cross-validation test of the perceptron vs the GA when testing bug 2910 ( http://bugzilla.spamassassin.org/show_bug.cgi?id=2910 ) – JustinMason 21/01/04

Wiki Markup\[check it out:\]First, I checked out the source:

No Format
svn co https://svn.apache.org/repos/asf/incubator/spamassassin/trunk
cd trunk
perl Makefile.PL 
make
cd masses

Wiki Markup\[also get pgapack and install as "masses/pgapack". I just scp'd in an already-built tree I had here.\]

Wiki Markup\[and use the set-0 logs from the 2.60 GA run -- taken from the rsync repository:\]

No Format
wc -l /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/ham-set0.log /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/spam-set0.log
 210442 /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/ham-set0.log
 354479 /home/corpus-rsync/corpus/Obsolete/submit-2.60-GA-run1/spam-set0.log

Wiki Markup\[we want about 2k in each bucket, otherwise it'll take weeks to complete. use split-logs-into-buckets to juggle the log files in blocks of 10% to get the ratio and size to around 2k:2k.\]

Wiki Markup\[ham buckets first:\]

No Format
./tenpass/split-log-into-buckets 10 < /home/corpus-rsync/corpus/Obsolete/submit-2.6
0-GA-run1/ham-set0.log
mv split-1.log new
./tenpass/split-log-into-buckets 10 < new
wc -l split-1.log
   2104 split-1.log

Wiki Markup\[much better!\]

No Format
mv split-*.log ../../logs/nonspam-jm/

./tenpass/split-log-into-buckets 10 < /home/corpus-rsync/corpus/Obsolete/submit-2.6
0-GA-run1/spam-set0.log
mv split-1.log new
wc -l new
  35437 new

Wiki Markup\[given this, we want 6 of the 10 logfiles to make 21264 lines, which would result in a roughly even ham:spam ratio for testing. let's do that.\]

No Format
cat split-{1,2,3,4,5,6}.log > new
./tenpass/split-log-into-buckets 10 < new
wc -l split-1.log
   2126 split-1.log

Wiki Markup\[perfect!\]

mv split-*.log ../../logs/spam-jm/
}}}

Wiki Markup\[and doublecheck the log sizes:\]

No Format
wc -l ../../logs/*/*.log
   2104 ../../logs/nonspam-jm/split-1.log
   2103 ../../logs/nonspam-jm/split-10.log
   2106 ../../logs/nonspam-jm/split-2.log
   2103 ../../logs/nonspam-jm/split-3.log
   2102 ../../logs/nonspam-jm/split-4.log
   2105 ../../logs/nonspam-jm/split-5.log
   2102 ../../logs/nonspam-jm/split-6.log
   2103 ../../logs/nonspam-jm/split-7.log
   2103 ../../logs/nonspam-jm/split-8.log
   2104 ../../logs/nonspam-jm/split-9.log
   2126 ../../logs/spam-jm/split-1.log
   2127 ../../logs/spam-jm/split-10.log
   2126 ../../logs/spam-jm/split-2.log
   2126 ../../logs/spam-jm/split-3.log
   2128 ../../logs/spam-jm/split-4.log
   2126 ../../logs/spam-jm/split-5.log
   2126 ../../logs/spam-jm/split-6.log
   2126 ../../logs/spam-jm/split-7.log
   2126 ../../logs/spam-jm/split-8.log
   2125 ../../logs/spam-jm/split-9.log
  42297 total

Wiki Markup\[looks fine. now run the 10pass master script.\]

No Format
nohup sh -x ./tenpass/10pass-run &

...