Running the GA to generate scores
As used in the RescoreMassCheck process.
Firstly, check that the rules and logs are both relatively clean and ready to use.
Copy/link the full source logs to "ham-full.log" and "spam-full.log" in the masses directory. Then:
Go through the HitFrequencies report in freqs and check:
- ALL_TRUSTED hitrate on spam. This should appear only in ham.
- unfamiliar rules with high ham hitrates; they could be easily forgeable. comment them or mark them "tflags nopublish".
- NO_RECEIVED hitrate in spam.
- NO_RELAYS hitrate in spam.
Save a copy of freqs, then generate ranges:
examine tmp/ranges.data and check:
- ranges that are 0.000 0.000 0 for no obvious reason;
- rules named with a "T_" prefix. These can sometimes slip through if used in promoted meta rules. They should be fixed to not include a "T_" prefix in the rulesrc source file. (that should be the only way that a T_ rule will appear in the output; "real" sandbox T_ rules should be removed already, since you deleted the sandbox rule file.)
To prepare your environment for running the rescorer:
Copy a config file from "config.set0"/"set1"/"set2"/"set3" to "config", and execute the runGA script. runGA generates and uses a randomly selected corpus with 90% being used for training and 10% being used for testing.
You need to ensure an up-to-date version of perl is used. On the zone, this is /local/perl586.
monitor progress... once the GA is compiled, and starts running, if the FP%/FN% rates are too crappy, it may be worth CTRL-C'ing the runGA process and running a new one "by hand" with different switches:
if you do this though you will have to cut and paste the post-GA commands (in the "POST-GA COMMANDS" section of runGA) by hand!
Once the GA run is complete, and you're happy with the accuracy: You will find your results in a directory of the form "gen-$NAME-$HAM_PREFERENCE-$THRESHOLD-$EPOCHS-$NOTE-ga".
Compare the listed FP%/FN% rate on gen-*/test to gen-*/scores; gen-*/scores is the output from the perceptron, and should match within a few 0.1% to gen-*/test output (which is computed on a separate subset of the mail messages). This checks:
- that the mail messages are diverse enough to avoid overfitting (hence the different test and train sets)
- that the FP%/FN% computations are not losing precision due to C-vs-Perl floating-point bugs, or a differing idea of what rules are promoted vs not promoted between the C and Perl code.
Once you're satisfied, check in ../rules/50_scores.cf. Copy the "config" file back to "config.setN" where "N" is the current scoreset, and check that in. Then, add a comment to the rescoring bugzilla bug, noting:
- the "gen-*/test" file contents, with FP%/FN% rate
- the "gen-*" path for later reference
next, carry on with other steps from RescoreMassCheck (if that's what you're doing).
To get garescorer to build with the above "make > make.out 2>&1" command on an Ubuntu Maverick machine, I installed the libpgapack-serial1 package, and ran:
The first symptom you are likely to see of this problem is the error:
To take advantage of multiple CPU cores, use pgapack-mpi (which appears to be broken on ubuntu), and run it as:
Replace "4" with your number of CPU cores. Although it looks like this causes redundant processing instead of distributed load.