Child pages
  • UploadedCorporaIndependentMassCheck
Skip to end of metadata
Go to start of metadata

Using uploaded corpora with an independent mass-check

The NewUploadedCorporaUser page describes setting up a ruleQA user so that an uploaded corpus will be mass-checked using the mass-check client/server setup. However, a bug means that doesn't support C/S mode, for some reason, so instead to use that resource, some of the uploaded corpora are scanned separately in traditional single-machine non-distributed mode. Here are the commands used to set up a new uid on that machine, for PMC members.

First, log into (You'll probably need to have an account created for you first.)

set some variables:


create a uid:

sudo useradd -c "Nightly mass-check jm" $BBUSERNAME
sudo passwd $BBUSERNAME
[give the new account a random password. this is needed for cron to work!]
sudo mkdir -p /export/home/$BBUSERNAME
sudo chown $BBUSERNAME /export/home/$BBUSERNAME
sudo -H -u $BBUSERNAME bash

you are now running as the new uid. Follow instructions similar to :

cd $HOME
mkdir tmp
echo '' > .forward
svn co svn

Accept (p)ermanently when asked.

cp trunk/masses/rule-qa/corpus.example ~/.corpus
vi ~/.corpus

use something like this:

opts_weekly="--net -j 8 --reuse --cache --cachedir=/tmp/aicache_nightly --restart=500 ham:detect:/export/h
ome/bbmass/uploadedcorpora/jm/ham/* --after="-15552000" --tail=40000 --scanprob=0.3 spam:detect:/export/ho
opts_nightly="--reuse --cache --cachedir=/tmp/aicache_nightly --restart=500 ham:detect:/export/home/bbmass
/uploadedcorpora/jm/ham/* --after="-15552000" --tail=40000 --scanprob=0.3 spam:detect:/export/home/bbmass/

Replace BBUSERNAME with the value of $BBUSERNAME, and RSYNC_PASSWORD with the correct pwd for that rsync user.

Then, run the mass-check just to see if it works (feel free to CTRL-C once you're happy):

bash $HOME/svn/masses/rule-qa/corpus-nightly

Then set up the cron using 'EDITOR=vi crontab -e':

0 9 * * * bash svn/masses/rule-qa/corpus-nightly

Hopefully that should do it (wink)

  • No labels