This document will walk you through using the pipeline in a variety of scenarios. Once you've gained a sense for how the pipeline works, you can consult the [pipeline page](pipeline.html) for a number of other options available in the pipeline.
Download and Setup
Download and install Joshua as described on the Getting Started page.
A basic pipeline run
Warning: the Joshua pipeline is a VLPS (very long Perl script). The script does the job, for the most part, but it is difficult to follow its internal logic, due to its having started as a quick script to get the job done, and having been written in Perl and not carefully software engineered. Plans are in the works for a rewrite, but until then, you'll have to suffer in silence.
Joshua includes a script that can help you with many of the stages of building a translation system. For today's experiments, we'll be using it to build a Spanish–English system using data included in the Fisher and CALLHOME translation corpus](/data/fisher-callhome-corpus/). This data was collected by translating transcribed speech from previous LDC releases.
Download the data and install it somewhere:
mkdir ~/joshua-tutorial cd ~/joshua-tutorial wget --no-check -O fisher-callhome-corpus.zip https://github.com/joshua-decoder/fisher-callhome-corpus/archive/master.zip unzip fisher-callhome-corpus.zip
Then define the environment variable `$FISHER` to point to it:
cd ~/joshua-tutorial/fisher-callhome-corpus-master export FISHER=$(pwd)
Preparing the data
Inside the tarball is the Fisher and CALLHOME Spanish–English data, which includes Kaldi-provided ASR output and English translations on the Fisher and CALLHOME dataset transcriptions. Because of licensing restrictions, we cannot distribute the Spanish transcripts, but if you have an LDC site license, a script is provided to build them. You can type:
Where the first argument is the path to your LDC data release. This will create the files in `corpus/ldc`.
In `$FISHER/corpus`, there are a set of parallel directories for LDC transcripts (`ldc`), ASR output (`asr`), oracle ASR output (`oracle`), and ASR lattice output (`plf`). The files look like this inside every directory:
$ ls corpus/ldc callhome_devtest.en fisher_dev2.en.2 fisher_dev.en.2 fisher_test.en.2 callhome_evltest.en fisher_dev2.en.3 fisher_dev.en.3 fisher_test.en.3 callhome_train.en fisher_dev2.es fisher_dev.es fisher_test.es fisher_dev2.en.0 fisher_dev.en.0 fisher_test.en.0 fisher_train.en fisher_dev2.en.1 fisher_dev.en.1 fisher_test.en.1 fisher_train.es
If you don't have the LDC transcripts, you can use the data in `corpus/asr` instead (just substitute `corpus/asr` wherever you see `corpus/ldc`). This will give a model that knows how to translate the output of a Spanish speech recognition system with a 40% error rate; since the Spanish data is not as good, the translation system will not be as good, but it will probably still be all right. We will now use this data to build our own Spanish–English model using Joshua's pipeline.
Note: It is often the case that Hadoop's HDFS file system is not populated with user directories. You may need to run the following command:
hadoop fs -mkdir /user/$USER
Run the pipeline
Create an experiments directory for containing your first experiment. Note: we recommend that this not be inside your `$JOSHUA` directory.
mkdir ~/joshua-tutorial/runs cd ~/joshua-tutorial/runs
The next step is to create the baseline run, using a particular directory structure for experiments that will allow us to take advantage of scripts provided with Joshua for displaying the results of many
related experiments. Because this can take quite some time to run, we are going to reduce the model by quite a bit by restriction: Joshua will only use sentences in the training sets with 11 or fewer words on either side (Spanish or English), and we will run only three iterations of tuning, again, only on sentences with 11 or fewer words:
$JOSHUA/bin/pipeline.pl \ --rundir 1 \ --readme "Baseline Hiero run" \ --source es \ --target en \ --type hiero \ --corpus $FISHER/corpus/ldc/fisher_train \ --tune $FISHER/corpus/ldc/fisher_dev \ --test $FISHER/corpus/ldc/fisher_dev2 \ --maxlen 11 \ --maxlen-tune 11 \ --maxlen-test 11 \ --tuner-iterations 1 \ --lm-order 3
This will start the pipeline building a Spanish--English translation system constructed from the training data and a dictionary, tuned against dev, and tested against devtest. It will use the
default values for most of the pipeline: GIZA++ for alignment, KenLM's `lmplz` for building the language model and for run-time querying, Z-MERT for tuning, and so on. We change the order of the n-gram model to 3 (from its default of 5) because there is not enough data to build a 5-gram LM.
This should take about 20 minutes to run. In the end, you will have a model with a low BLEU score on your test set.
Once that is finished, you will have a baseline model. From there, you might wish to try variations
of the baseline model. Here are some examples of what you could vary:
- Build an SAMT model (`--type samt`), GKHM model (`--type ghkm`), or phrasal ITG model (`--type phrase`)
- Use the Berkeley aligner instead of GIZA++ (`--aligner berkeley`)
- Decode with BerkeleyLM (`--lm berkeleylm`) instead of KenLM (the default)
- Change the order of the LM to 4 (`--lm-order 4`)
- Tune with MIRA instead of MERT (`--tuner kbmira`). This requires that Moses is installed and is pointed to by the environment variable `$MOSES`.
- Tune with a wider beam (`--joshua-args '-pop-limit 200'`) (the default is 100)
To do this, we will create new runs that partially reuse the results of previous runs. This is possible by doing two things:
- incrementing the run directory and providing an updated README note;
- telling the pipeline which of the many steps of the pipeline to begin at; and
- providing the needed dependencies.
A second run
Let's begin by building a phrase-based model instead of a hierarchical one.. To do so, we change the run directory, and otherwise repeat the previous command:
$JOSHUA/bin/pipeline.pl \ --rundir 2 \ --readme "Baseline phrase run" \ --source es \ --target en \ --type phrase \ --corpus $FISHER/corpus/ldc/fisher_train \ --tune $FISHER/corpus/ldc/fisher_dev \ --test $FISHER/corpus/ldc/fisher_dev2 \ --maxlen 11 \ --maxlen-tune 11 \ --maxlen-test 11 \ --tuner-iterations 1 \ --lm-order 3
Here, we have essentially the same invocation, except for the run directory and the specification of the model type ("phrase" instead of "hiero").
However, this is somewhat wasteful: we have already preprocessed the corpora and run alignment, which for bigger models can be expensive. We can reuse the previous results if we provide the decoder with more information, telling it
- to start at the model-building stage (--first-step model)
- the location of the preprocessed corpora (--tune, --test, and --corpus)
- the location of the alignment
- to skip doing data preparation
Here is an alternate version of Run 2, that builds on the results of Run 1.
$JOSHUA/bin/pipeline.pl \ --rundir 3 \ --readme "Baseline phrase run, picking up from run 1" \ --source es \ --target en \ --type hiero \ --first-step model --no-prepare \ --alignment 1/alignments/training.align \ --corpus 1/data/train/corpus \ --tune 1/data/tune/corpus \ --test 1/data/test/corpus \ --maxlen 11 \ --maxlen-tune 11 \ --maxlen-test 11 \ --tuner-iterations 1 \ --lm-order 3
The result here should be identical to that of Run #2, except for random variance from the tuner.
Issues with Larger Data Sets
When using larger data sets, you will need to provide different stages of the pipeline with more memory. Here are some common ones.
- --joshua-mem XXg: the amount of memory used by the decoder
- --aligner-mem: used by the aligner. This is only really necessary if you are using the Berkeley aligner.
- --packer-mem: memory used by the grammar packing procedures.
Analyzing the data