Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: added example benchmarks and credits files

...

  1. The language pair, using the ISO 639-1 two-character code. This should correspond to what you used when you ran the pipeline.
  2. The tuned Joshua config file. It is best if this contains model file paths that are absolute instead of relative.
  3. The amount of memory Joshua will use when running. The default is 4 GB. To estimate the amount needed, sum the file sizes of all model files (language models and grammars) and round to nearest 2 GB. For example, if your language model is 2.1 GB, and your packed grammar is 0.8 GB, 4 GB should be fine.
  4. The credits file contains information about who build the language pack, and what data sources were used to do so.
  5. The benchmark file should contain information about how well the language pack performs on a range of standard test sets for the language.
  6. The example should be two small files. These will be referenced in the README file that is created, and provide a quick way for a user to test the language pack.

Example Benchmark and Credits files

There is no particular prescribed format for these files. They should be human-readable files that would provide some guidance for a human wishing to evaluate performance of the models on a range of popular test steps, and to gather all the training data for themselves, should they wish to build their own similar model.

Here is an example Benchmark file (used in our 2016-11-18 Turkish–English model):

These benchmarks are the results of a phrase-based model using the last 2,500 lines of bitext (held-out) from each OPUS training source.

4936999 parallel sentences were used to build the tr-en model.

Single-reference four-gram BLEU scores are reported for each test set.

KDE4                        0.1571
OpenSubtitles2016   0.1377
SETIMES2                0.2204
Tanzil                         0.1288
Tatoeba                  0.2977
TED2013                  0.1505
Wikipedia                0.3072 

And here is an example Credits file, from the same language pack:

Language Pack (tr to en) created by Paul McNamee (mcnamee@jhu.edu) on 10/20/16.

The following corpora were used to train the model:
    bible-literal GlobalVoices KDE4 OpenSubtitles2016 SETIMES2 Tanzil Tatoeba TED2013 Ubuntu Wikipedia

Except the Bible, these corpora are available from the OPUS portal at:
    http://opus.lingfil.uu.se/

The OPUS corpora were downloaded from the website on 10/4/16.
The last 5,000 lines of each bitext are used for tuning (1st 2,500 lines)
and testing (2nd 2,500 lines). Up to the first 3 million lines of each
training file are used in building the model.

The target (English) side of the bitext was used in addition to a 2% sample
of English Gigaword 5th (LDC2011T07) to construct the language model.

The result

After running the script, you will find a directory structure that looks like the following:

...