Language packs are pre-built translation models with an included instance of the Joshua runtime environment. A key feature is that there are no dependencies (apart from Java 8). Getting a machine translation system running on your own machine is as easy as downloading the tarball, unpacking it, and running the included shell script.
Info | ||
---|---|---|
| ||
(March 2017) Version 3 language packs with Kenlm (via Docker) and more complete Google Translate API support are coming soonLanguage packs for 62 languages have been released! These models should be considered provisional, in the style publishing something and then iterating and improving as demand and resources become available. If you have questions, comments, concerns, or wish to help, please post questions to the Joshua mailing list: dev@joshua.apache.org. |
...
ISO 639 | Language pair | Release Date | Size | Version | Notes |
---|---|---|---|---|---|
en-en | English–English | 2016-11-18 | various | 2 | English paraphrase packs from the Paraphrase Database |
am-en | Amharic–English | 2016-11-18 | 841 MB | 2 | |
ar-en | Arabic–English | 2016-11-18 | 1.4 GB | 2 |
|
az-en | Azerbaijani–English | 2016-11-18 | 846 MB | 2 | |
bg-en | 2016-11-18 | 2.2 GB | 2 | ||
bn-en | Bengali–English | 2016-11-18 | 893 MB | 2 | |
bs-en | Bosnian–English | 2016-11-18 | 1.4 GB | 2 | |
ca-en | Catalan–English | 2016-11-18 | 936 MB | 2 | |
cs-en | Czech–English | 2016-11-18 | 2.7 GB | 2 | |
da-en | Danish–English | 2016-11-18 | 3.5 GB | 2 | |
de-en | German–English | 2016-11-18 | 4.0 GB | 2 | |
dv-en | 2016-11-18 | 873 MB | 2 | ||
el-en | Greek–English | 2016-11-18 | 3.2 GB | 2 | |
en-de | English–German | 2017-01-31 | 4.5 GB | 2 | Phrase-based model |
en-ru | English-Russian | 4.6 GB | 2 | Language model data sources can be found within the artifact README file | |
es-en | Spanish–English | 2016-11-18 | 4.8 GB | 2 | |
et-en | Estonian–English | 2016-11-18 | 2.2 GB | 2 | |
eu-en | Basque–English | 2016-11-18 | 877 MB | 2 | |
fa-en | Persian–English | 2016-11-18 | 1.3 GB | 2 | |
fi-en | Finnish–English | 2016-11-18 | 2.6 GB | 2 | |
fr-en | French–English | 2016-11-18 | 4.0 GB | 2 | |
ga-en | Irish–English | 2016-11-18 | 866 MB | 2 | |
gl-en | Galician–English | 2016-11-18 | 879 MB | 2 | |
ha-en | Hausa–English | 2016-11-18 | 853 MB | 2 | |
he-en | Hebrew–English | 2016-11-18 | 1.4 GB | 2 | |
hi-en | Hindi–English | 2016-11-18 | 858 MB | 2 | |
hr-en | Croatian–English | 2016-11-18 | 1.4 GB | 2 | |
hu-en | Hungarian–English | 2016-11-18 | 2.0 GB | 2 | |
id-en | Indonesian–English | 2016-11-18 | 1.4 GB | 2 | |
is-en | Icelandic–English | 2016-11-18 | 1.1 GB | 2 | |
it-en | Italian–English | 2016-11-18 | 3.9 GB | 2 | |
ka-en | Georgian–English | 2016-11-18 | 849 MB | 2 | |
ku-en | Kurdish–English | 2016-11-18 | 827 MB | 2 | |
lt-en | Lithuanian–English | 2016-11-18 | 2.0 GB | 2 | |
lv-en | Latvian–English | 2016-11-18 | 2.0 GB | 2 | |
mg-en | Malagasy–English | 2016-11-18 | 907 MB | 2 | |
mk-en | Macedonian–English | 2016-11-18 | 1.4 GB | 2 | |
ml-en | Malayalam–English | 2016-11-18 | 851 MB | 2 | |
ms-en | Malay–English | 2016-11-18 | 1014 MB | 2 | |
mt-en | Maltese–English | 2016-11-18 | 1.4 GB | 2 | |
nl-en | Dutch–English | 2016-11-18 | 3.6 GB | 2 | |
no-en | Norwegian–English | 2016-11-18 | 1.4 GB | 2 | |
pl-en | Polish–English | 2016-11-18 | 2.8 GB | 2 | |
pt-en | Portuguese–English | 2016-11-18 | 4.5 GB | 2 | |
ro-en | Romanian–English | 2016-11-18 | 2.5 GB | 2 | |
ru-en | Russian–English | 2016-11-18 | 1.9 GB | 2 | |
ru-en | Russian-English | 4.4 GB | 2 | Language model data sources can be found within the artifact README file | |
sd-en | Sindhi–English | 2016-11-18 | 837 MB | 2 | |
si-en | Sinhala–English | 2016-11-18 | 862 MB | 2 | |
sk-en | Slovak–English | 2016-11-18 | 2.4 GB | 2 | |
sl-en | Slovenian–English | 2016-11-18 | 2.3 GB | 2 | |
so-en | Somali–English | 2016-11-18 | 850 MB | 2 | |
sq-en | Albanian–English | 2016-11-18 | 1.3 GB | 2 | |
sr-en | Serbian–English | 2016-11-18 | 1.5 GB | 2 | |
sv-en | Swedish–English | 2016-11-18 | 3.4 GB | 2 | |
sw-en | Swahili–English | 2016-11-18 | 859 MB | 2 | |
ta-en | Tamil–English | 2016-11-18 | 832 MB | 2 | |
te-en | Telugu–English | 2016-11-18 | 823 MB | 2 | |
tg-en | Tajik–English | 2016-11-18 | 851 MB | 2 | |
th-en | Thai–English | 2016-11-18 | 1.1 GB | 2 | |
tr-en | Turkish–English | 2016-11-18 | 1.4 GB | 2 | |
tt-en | Tatar–English | 2016-11-18 | 840 MB | 2 | |
ug-en | Uighur–English | 2016-11-18 | 838 MB | 2 | |
uk-en | Ukrainian–English | 2016-11-18 | 984 MB | 2 | |
ur-en | Urdu–English | 2016-11-18 | 866 MB | 2 | |
vi-en | Vietnamese–English | 2016-11-18 | 1.2 GB | 2 |
Using Language Packs
Once you download the model, unpack it. The simplest use-case is then to run Joshua as a standard UNIX tool, accepting a single line of input and writing a single line of output. Assuming your language pack is downloaded to "apache-joshua-language-pack.tgz":
...
- "-m XXg" — increase the amount of memory provided to Joshua. The default is 8g, but for the larger language packs, you will want 16 or 24. In general, 50% more memory than the raw model size should be more than sufficient.
- "-top-n N" — output up to N translation candidates, instead of just one.
- "-output-format STRING" — change the output format. By default, Joshua outputs just the single, tokenized translation with the highest model probability.
Here are some other options:- %s: the raw translated string
- %S: the detokenized translated string
- %e: the source string
- %i: the sequence number (0-indexed)
- %c: the model score
- %f: the feature string
Versions
The language pack version history:
Version | Description | Release Date |
---|---|---|
3 | Includes KenLM language model files (recommended) in addition to BerkeleyLM. The latter is the default, with the former recommended and facilitated with a Docker container. Google API now multithreaded. | March 2017 |
2 | Contains a "joshua" top-level script and "prepare.sh" for preparing data. Operates in server mode or from the command line. Entirely BerkeleyLM-based. Includes a Joshua 6.1 release candidate jar file. | November 2016 |
Citation
Please cite the following paper if you use Joshua in your research.
...