Language Packs

Language packs are pre-built translation models with an included instance of the Joshua runtime environment. A key feature is that there are no dependencies (apart from Java 8). Getting a machine translation system running on your own machine is as easy as downloading the tarball, unpacking it, and running the included shell script.

Language Packs Are Being Updated

Language packs for 62 languages have been released! These models should be considered provisional, in the style publishing something and then iterating and improving as demand and resources become available. If you have questions, comments, concerns, or wish to help, please post questions to the Joshua mailing list: user@joshua.apache.org.

The following language packs are available for Joshua. Click the links on the full language pair names to download the models directly. You might be interested in notes on how most of these models were built, including information about how to make them faster (with a little elbow-grease), better (with a little knowledge), and what you might want to do with them.

ISO 639	Language pair	Release Date	Size	Notes
en-en	English–English	2016-11-18	various	English paraphrase packs from the Paraphrase Database
am-en	Amharic–English	2016-11-18	841 MB
ar-en	Arabic–English	2016-11-18	1.4 GB
az-en	Azerbaijani–English	2016-11-18	846 MB
bg-en	Bulgarian–English	2016-11-18	2.2 GB
bn-en	Bengali–English	2016-11-18	893 MB
bs-en	Bosnian–English	2016-11-18	1.4 GB
ca-en	Catalan–English	2016-11-18	936 MB
cs-en	Czech–English	2016-11-18	2.7 GB
da-en	Danish–English	2016-11-18	3.5 GB
de-en	German–English	2016-11-18	4.0 GB
dv-en	Dhivehi–English	2016-11-18	873 MB
el-en	Greek–English	2016-11-18	3.2 GB
en-ru	English-Russian	28 Oct 2016	4.6 GB	Language model data sources can be found within the artifact README file
es-en	Spanish–English	2016-11-18	4.8 GB
et-en	Estonian–English	2016-11-18	2.2 GB
eu-en	Basque–English	2016-11-18	877 MB
fa-en	Persian–English	2016-11-18	1.3 GB
fi-en	Finnish–English	2016-11-18	2.6 GB
fr-en	French–English	2016-11-18	4.0 GB
ga-en	Irish–English	2016-11-18	866 MB
gl-en	Galician–English	2016-11-18	879 MB
ha-en	Hausa–English	2016-11-18	853 MB
he-en	Hebrew–English	2016-11-18	1.4 GB
hi-en	Hindi–English	2016-11-18	858 MB
hr-en	Croatian–English	2016-11-18	1.4 GB
hu-en	Hungarian–English	2016-11-18	2.0 GB
id-en	Indonesian –English	2016-11-18	1.4 GB
is-en	Icelandic–English	2016-11-18	1.1 GB
it-en	Italian–English	2016-11-18	3.9 GB
ka-en	Georgian–English	2016-11-18	849 MB
ku-en	Kurdish–English	2016-11-18	827 MB
lt-en	Lithuanian–English	2016-11-18	2.0 GB
lv-en	Latvian–English	2016-11-18	2.0 GB
mg-en	Malagasy–English	2016-11-18	907 MB
mk-en	Macedonian–English	2016-11-18	1.4 GB
ml-en	Malayalam–English	2016-11-18	851 MB
ms-en	Malay–English	2016-11-18	1014 MB
mt-en	Maltese–English	2016-11-18	1.4 GB
nl-en	Dutch–English	2016-11-18	3.6 GB
no-en	Norwegian–English	2016-11-18	1.4 GB
pl-en	Polish–English	2016-11-18	2.8 GB
pt-en	Portuguese–English	2016-11-18	4.5 GB
ro-en	Romanian–English	2016-11-18	2.5 GB
ru-en	Russian–English	2016-11-18	1.9 GB
ru-en	Russian-English	04 Nov 2016	4.4 GB	Language model data sources can be found within the artifact README file
sd-en	Sindhi–English	2016-11-18	837 MB
si-en	Sinhala–English	2016-11-18	862 MB
sk-en	Slovak–English	2016-11-18	2.4 GB
sl-en	Slovenian–English	2016-11-18	2.3 GB
so-en	Somali–English	2016-11-18	850 MB
sq-en	Albanian–English	2016-11-18	1.3 GB
sr-en	Serbian–English	2016-11-18	1.5 GB
sv-en	Swedish–English	2016-11-18	3.4 GB
sw-en	Swahili–English	2016-11-18	859 MB
ta-en	Tamil–English	2016-11-18	832 MB
te-en	Telugu–English	2016-11-18	823 MB
tg-en	Tajik–English	2016-11-18	851 MB
th-en	Thai–English	2016-11-18	1.1 GB
tr-en	Turkish–English	2016-11-18	1.4 GB
tt-en	Tatar–English	2016-11-18	840 MB
ug-en	Uighur–English	2016-11-18	838 MB
uk-en	Ukrainian–English	2016-11-18	984 MB
ur-en	Urdu–English	2016-11-18	866 MB
vi-en	Vietnamese–English	2016-11-18	1.2 GB

Using Language Packs

Once you download the model, unpack it. The simplest use-case is then to run Joshua as a standard UNIX tool, accepting a single line of input and writing a single line of output. Assuming your language pack is downloaded to "apache-joshua-language-pack.tgz":

# SRC and TRG are the two-character ISO 639-1 language codes
tar xzf apache-joshua-SRC-TRG-YYYY-MM-DD.tgz
cd apache-joshua-SRC-TRG-YYYY-MM-DD
cat example.SRC | ./prepare.sh | ./joshua

Here, "example.SRC" is a file containing sentences in your input language (e.g., "es" for Spanish), one per line. Joshua expects to be given one sentence at a time; it will not do this for documents by itself.

There is some startup cost associated with the models, however. You may find it more beneficial, therefore, to run it as a server. Joshua can run in two server modes: raw TCP, and HTTP.

# start in server mode, taking direct TCP/IP connections
./joshua -server-port 5674 -server-type tcp
cat example.SRC | nc localhost 5674
 
# start in server mode, answering web queries.
./joshua -server-port 5674 -server-type http
# Then open "web/index.html?port=5674" in your browser

Decoder Options

Joshua supports many command-line options controlling its output. By default, it outputs only a single hypothesis per input line. Here are some options that may be useful to you:

"-m XXg" — increase the amount of memory provided to Joshua. The default is 8g, but for the larger language packs, you will want 16 or 24. In general, 50% more memory than the raw model size should be more than sufficient.
"-top-n N" — output up to N translation candidates, instead of just one.
"-output-format STRING" — change the output format. By default, Joshua outputs just the single, tokenized translation with the highest model probability.
Here are some other options:
- %s: the raw translated string
- %S: the detokenized translated string
- %e: the source string
- %i: the sequence number (0-indexed)
- %c: the model score
- %f: the feature string
These can all be combined in a single string, e.g., -output-format "%i ||| %s ||| %f ||| %c"

Citations

Please cite the following paper if you use Joshua in your research.

  @article{post2015joshua,
    Author = {Post, Matt and Cao, Yuan and Kumar, Gaurav},
    Journal = {The Prague Bulletin of Mathematical Linguistics},
    Title = {Joshua 6: A phrase-based and hierarchical statistical machine translation system},
    Year = {2015}
  }

Space shortcuts

Page tree

Using Language Packs

Decoder Options

Citations