Notes on Language Pack Creation

Most of the many language packs for Joshua were built using a very generic phrase-based approach built from freely available datasets downloadable from cOrPUS. Here are a number of things that may be useful for you to know in using them.

Output Quality. Don't expect to be super pleased with the output. The models are very simple phrase-based translation models with fixed distortion limits. For details, see the CREDITS file inside each language pack.
Help Us Improve Models. If you have interest in improving the results for a particular language pair, we have lots of ideas. Please contact us at dev@joshua.apache.org and we can help you out. If your models have better results on our test sets, we would be happy to replace the currently distributed model with yours!
Docker Containers. The versions we distribute have zero external dependencies, other than Joshua 8: you simply download the language pack tarball, unpack it, and start translating. You can get better translation results if you replace the included BerkeleyLM language model with KenLM. Instructions on how to do this can be found below. This is not included with the release because it requires compiling somewhat-finicky C++ code, but you might be able to do it with little trouble. In the near future, we plan to release Docker containers that will make compilation of KenLM painless.

Replacing BerkeleyLM with KenLM

This involves just a few steps (details soon).

Compile the KenLM wrapper and place it in a lib/ directory within your language pack.
Replace the "LanguageModel" lines in joshua.config with "StateMinimizingLanguageModel" lines.

Space shortcuts

Page tree

Replacing BerkeleyLM with KenLM