Skip to end of metadata
Go to start of metadata

Language packs are pre-built translation models with an included instance of the Joshua runtime environment. A key feature is that there are no dependencies (apart from Java 8). Getting a machine translation system running on your own machine is as easy as downloading the tarball, unpacking it, and running the included shell script.

Language Packs Are Being Updated

Language packs for 62 languages have been released! These models should be considered provisional, in the style publishing something and then iterating and improving as demand and resources become available. If you have questions, comments, concerns, or wish to help, please post questions to the Joshua mailing list: dev@joshua.apache.org.

Table of Contents

Language Packs

The following language packs are available for Joshua. Click the links on the full language pair names to download the models directly. You might be interested in notes on how most of these models were built, including information about how to make them faster (with a little elbow-grease), better (with a little knowledge), and what you might want to do with them.

ISO 639Language pairRelease DateSizeNotes
en-enEnglish–English2016-11-18variousEnglish paraphrase packs from the Paraphrase Database
am-enAmharic–English2016-11-18841 MB 
ar-enArabic–English2016-11-181.4 GB

 

az-enAzerbaijani–English2016-11-18846 MB 
bg-en

Bulgarian–English

2016-11-182.2 GB 
bn-enBengali–English2016-11-18893 MB 
bs-enBosnian–English2016-11-181.4 GB 
ca-enCatalan–English2016-11-18936 MB 
cs-enCzech–English2016-11-182.7 GB 
da-enDanish–English2016-11-183.5 GB 
de-enGerman–English2016-11-184.0 GB 
dv-en

Dhivehi–English

2016-11-18873 MB 
el-enGreek–English2016-11-183.2 GB 
en-deEnglish–German2017-01-314.5 GBPhrase-based model
en-ruEnglish-Russian 4.6 GBLanguage model data sources can be found within the artifact README file
es-enSpanish–English2016-11-184.8 GB 
et-enEstonian–English 2016-11-182.2 GB 
eu-enBasque–English 2016-11-18877 MB 
fa-enPersian–English 2016-11-181.3 GB 
fi-enFinnish–English 2016-11-182.6 GB 
fr-enFrench–English 2016-11-184.0 GB 
ga-enIrish–English 2016-11-18866 MB 
gl-enGalician–English 2016-11-18879 MB 
ha-enHausa–English2016-11-18853 MB 
he-enHebrew–English2016-11-181.4 GB 
hi-enHindi–English 2016-11-18858 MB 
hr-enCroatian–English 2016-11-181.4 GB 
hu-enHungarian–English 2016-11-182.0 GB 
id-enIndonesian–English2016-11-181.4 GB 
is-enIcelandic–English 2016-11-181.1 GB 
it-enItalian–English 2016-11-183.9 GB 
ka-enGeorgian–English 2016-11-18849 MB 
ku-enKurdish–English 2016-11-18827 MB 

lt-en

Lithuanian–English2016-11-182.0 GB 
lv-enLatvian–English 2016-11-182.0 GB 
mg-enMalagasy–English 2016-11-18907 MB 
mk-enMacedonian–English 2016-11-181.4 GB 
ml-enMalayalam–English 2016-11-18851 MB 
ms-enMalay–English 2016-11-181014 MB 
mt-enMaltese–English 2016-11-181.4 GB 
nl-enDutch–English 2016-11-183.6 GB 
no-enNorwegian–English 2016-11-181.4 GB 
pl-enPolish–English 2016-11-182.8 GB 
pt-enPortuguese–English 2016-11-184.5 GB 
ro-enRomanian–English 2016-11-182.5 GB 
ru-enRussian–English 2016-11-181.9 GB 
ru-enRussian-English 4.4 GBLanguage model data sources can be found within the artifact README file
sd-enSindhi–English 2016-11-18837 MB 
si-enSinhala–English 2016-11-18862 MB 
sk-enSlovak–English 2016-11-182.4 GB 
sl-enSlovenian–English 2016-11-182.3 GB 
so-enSomali–English 2016-11-18850 MB 
sq-enAlbanian–English 2016-11-181.3 GB 
sr-enSerbian–English2016-11-181.5 GB 
sv-enSwedish–English2016-11-183.4 GB 
sw-enSwahili–English 2016-11-18859 MB 
ta-enTamil–English 2016-11-18832 MB 
te-enTelugu–English 2016-11-18823 MB 
tg-enTajik–English 2016-11-18851 MB 
th-enThai–English 2016-11-181.1 GB 
tr-enTurkish–English 2016-11-181.4 GB 
tt-enTatar–English 2016-11-18840 MB 
ug-enUighur–English 2016-11-18838 MB 
uk-enUkrainian–English 2016-11-18984 MB 
ur-enUrdu–English2016-11-18866 MB 
vi-enVietnamese–English2016-11-181.2 GB 

Using Language Packs

Once you download the model, unpack it. The simplest use-case is then to run Joshua as a standard UNIX tool, accepting a single line of input and writing a single line of output. Assuming your language pack is downloaded to "apache-joshua-language-pack.tgz":

Here, "example.SRC" is a file containing sentences in your input language (e.g., "es" for Spanish), one per line. Joshua expects to be given one sentence at a time; it will not do this for documents by itself.

There is some startup cost associated with the models, however. You may find it more beneficial, therefore, to run it as a server. Joshua can run in two server modes: raw TCP, and HTTP.

Improved Translation With KenLM

The goal in releasing the language packs above was to make it easy for people to run translation systems. Part of this meant having no external dependencies (apart from Java). This means that we were not able to include the excellent KenLM language modeling code. If you are able to compile this, you can use it instead of the provided BerkeleyLM. This will result in significantly better translation quality, load time, and memory usage.

Docker Support

Shortly (February 2017) we will release a docker module for compiling KenLM and loading and running any of the Joshua language packs with KenLM, providing an easy way to get these improvements that hides some of the complexity below.

  1. Download KenLM. You need to clone the Joshua repo, set some variables, and compile KenLM:

    If everything compiles correctly, this will produce a file in "lib/libken.so" (under Linux).

  2. Make a "lib" directory in your language pack, and copy the file "lib/libken.so" to it. 
  3. Within the language pack, there should be a file named "joshua.config.kenlm". Rename that file to "joshua.config".

You can now start the language pack per normal, and it will use KenLM instead of BerkeleyLM. Depending on your environment, you may have some trouble compiling KenLM and the Joshua JNI library. In general, it requires GCC 4.8+ and the Boost libraries.

Decoder Options

Joshua supports many command-line options controlling its output. By default, it outputs only a single hypothesis per input line. Here are some options that may be useful to you:

  • "-m XXg" — increase the amount of memory provided to Joshua. The default is 8g, but for the larger language packs, you will want 16 or 24. In general, 50% more memory than the raw model size should be more than sufficient.
  • "-top-n N" — output up to N translation candidates, instead of just one.
  • "-output-format STRING" — change the output format. By default, Joshua outputs just the single, tokenized translation with the highest model probability. 
    Here are some other options:
    • %s: the raw translated string
    • %S: the detokenized translated string
    • %e: the source string
    • %i: the sequence number (0-indexed)
    • %c: the model score
    • %f: the feature string
    These can all be combined in a single string, e.g., -output-format "%i ||| %s ||| %f ||| %c"

Citation

Please cite the following paper if you use Joshua in your research.

  @article{post2015joshua,
Author = {Post, Matt and Cao, Yuan and Kumar, Gaurav},
Journal = {The Prague Bulletin of Mathematical Linguistics},
Title = {Joshua 6: A phrase-based and hierarchical statistical machine translation system},
Year = {2015}
}
  • No labels

17 Comments

  1. Anonymous

    Is there any easy way to build an English -> Spanish language pack? I was able to follow your instructions for ES-EN. Thank you

  2. Which instructions did you follow? We are in the process of porting and updating the documentation to this site (Confluence); if you point me to the page you used, I can prioritize that.

    1. That was quick:

      I followed the instructions from this page:

      http://joshua.incubator.apache.org/6.0/install.html

      and downloaded the language pack from here:

      http://joshua.incubator.apache.org/language-packs/

      And followed one of the readme's inside the ES-EN language packs

      1. If you were able to build an ES-EN pack, you can build the other direction just by reversing the source and target languages. Or am I misunderstanding your question?

        1. I guess I need to build the pack, I just downloaded the pack from here http://joshua.incubator.apache.org/language-packs/es-en-phrase/

  3. Anonymous

    How many language packs are planned to be released soon? 

    1. We plan to release Spanish, Russian, Arabic, and Chinese (all translating into English) with the 6.1 release, coming out next month.

      1. Anonymous

        Thank you very much

  4. Anonymous

    I downloaded Spanish-English language pack, but the bin/joshua is a symbolic link, not executable binaries. From above article is says "The language packs will include the decoder runtime and will have no external dependencies", am I missing something?

    1. We are almost ready to release this, if you can wait till the end of the week. Much has changed with Joshua in preparation for the 6.1 release, including getting rid of the dependencies.

    2. Many have been released. Posting here in case this triggers a note to you.

  5. Anonymous

    Will the language pack release be soon?

  6. Matt Post did the links to my ru-en and en-ru packs disappear? If so then i can add them back in. Thanks

    1. Oh, yes, I assumed everything old was bad. Forgot you had made those. Can you repack them, though, with the latest build_lp.sh? It has a few important fixes in the tokenizer, web demo, and README.

      Then I'd suggest adding them to the table.