DUE TO SPAM, SIGN-UP IS DISABLED. Goto Selfserve wiki signup and request an account.
Running the GA to generate scores
As used in the RescoreMassCheck process.
Firstly, check that the rules and logs are both relatively clean and ready to use.
Copy/link the full source logs to "ham-full.log" and "spam-full.log" in the masses directory. Then:
cd masses make clean rm -rf ORIG NSBASE SPBASE ham-validate.log spam-validate.log ham.log spam.log svn revert ../rules/50_scores.cf ln -s ham-full.log ham.log ln -s spam-full.log spam.log make freqs SCORESET=3 less freqs
Go through the HitFrequencies report in freqs and check:
- ALL_TRUSTED hitrate on spam. This should appear only in ham.
- unfamiliar rules with high ham hitrates; they could be easily forgeable. comment them or mark them "tflags nopublish".
- NO_RECEIVED hitrate in spam.
- NO_RELAYS hitrate in spam.
Save a copy of freqs, then generate ranges:
cp freqs freqs.full make > make.out 2>&1 less tmp/ranges.data
examine tmp/ranges.data and check:
- ranges that are 0.000 0.000 0 for no obvious reason;
- rules named with a "T_" prefix. These can sometimes slip through if used in promoted meta rules. They should be fixed to not include a "T_" prefix in the rulesrc source file. (that should be the only way that a T_ rule will appear in the output; "real" sandbox T_ rules should be removed already, since you deleted the sandbox rule file.)
To prepare your environment for running the rescorer:
rm -rf ORIG NSBASE SPBASE ham-validate.log spam-validate.log ham.log spam.log mkdir ORIG for CLASS in ham spam ; do ln $CLASS-full.log ORIG/$CLASS.log for I in 0 1 2 3 ; do ln -s $CLASS.log ORIG/$CLASS-set$I.log done done
Score generation
Copy a config file from "config.set0"/"set1"/"set2"/"set3" to "config", and execute the runGA script. runGA generates and uses a randomly selected corpus with 90% being used for training and 10% being used for testing.
You need to ensure an up-to-date version of perl is used. On the zone, this is /local/perl586.
export PATH=/local/perl586/bin:$PATH nohup bash runGA & tail -f nohup.out
monitor progress... once the GA is compiled, and starts running, if the FP%/FN% rates are too crappy, it may be worth CTRL-C'ing the runGA process and running a new one "by hand" with different switches:
./garescorer -b 5.0 -s 100 -t 5.0
if you do this though you will have to cut and paste the post-GA commands (in the "POST-GA COMMANDS" section of runGA) by hand!
Once the GA run is complete, and you're happy with the accuracy: You will find your results in a directory of the form "gen-$NAME-$HAM_PREFERENCE-$THRESHOLD-$EPOCHS-$NOTE-ga".
Compare the listed FP%/FN% rate on gen-*/test to gen-*/scores; gen-*/scores is the output from the perceptron, and should match within a few 0.1% to gen-*/test output (which is computed on a separate subset of the mail messages). This checks:
- that the mail messages are diverse enough to avoid overfitting (hence the different test and train sets)
- that the FP%/FN% computations are not losing precision due to C-vs-Perl floating-point bugs, or a differing idea of what rules are promoted vs not promoted between the C and Perl code.
Once you're satisfied, check in ../rules/50_scores.cf. Copy the "config" file back to "config.setN" where "N" is the current scoreset, and check that in. Then, add a comment to the rescoring bugzilla bug, noting:
- the "gen-*/test" file contents, with FP%/FN% rate
- the "gen-*" path for later reference
next, carry on with other steps from RescoreMassCheck (if that's what you're doing).
PGAPack
To get garescorer to build with the above "make > make.out 2>&1" command on an Ubuntu Maverick machine, I installed the libpgapack-serial1 package, and ran:
mkdir -p /local/pgapack-1.0.0.1/lib ln -s /usr/lib /local/pgapack-1.0.0.1/lib/sun4 mkdir -p /local/pgapack-1.0.0.1 ln -s /usr/include/pgapack-serial /local/pgapack-1.0.0.1/include
The first symptom you are likely to see of this problem is the error:
time: cannot run ./garescorer: No such file or directory
To take advantage of multiple CPU cores, use pgapack-mpi (which appears to be broken on ubuntu), and run it as:
mpirun -np 4 ./garescorer -b 10 -e 5500 -t 5.0
Replace "4" with your number of CPU cores. Although it looks like this causes redundant processing instead of distributed load.