Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Language packs are pre-built translation models with an included instance of the Joshua runtime environment. A key feature is that there are no dependencies (apart from Java 8). Getting a machine translation system running on your own machine is as easy as downloading the tarball, unpacking it, and running the included shell script.

Info
titleVersion 3 Language Packs Are Being UpdatedComing Soon

(March 2017) Version 3 language packs with Kenlm (via Docker) and more complete Google Translate API support are coming soonLanguage packs for 62 languages have been released! These models should be considered provisional, in the style publishing something and then iterating and improving as demand and resources become available. If you have questions, comments, concerns, or wish to help, please post questions to the Joshua mailing list: dev@joshua.apache.org.

...

ISO 639Language pairRelease DateSizeVersionNotes
en-enEnglish–English2016-11-18various2English paraphrase packs from the Paraphrase Database
am-enAmharic–English2016-11-18841 MB2 
ar-enArabic–English2016-11-181.4 GB2

 

az-enAzerbaijani–English2016-11-18846 MB2 
bg-en

Bulgarian–English

2016-11-182.2 GB2 
bn-enBengali–English2016-11-18893 MB2 
bs-enBosnian–English2016-11-181.4 GB2 
ca-enCatalan–English2016-11-18936 MB2 
cs-enCzech–English2016-11-182.7 GB2 
da-enDanish–English2016-11-183.5 GB2 
de-enGerman–English2016-11-184.0 GB2 
dv-en

Dhivehi–English

2016-11-18873 MB2 
el-enGreek–English2016-11-183.2 GB2 
en-deEnglish–German2017-01-314.5 GB2Phrase-based model
en-ruEnglish-Russian 4.6 GB2Language model data sources can be found within the artifact README file
es-enSpanish–English2016-11-184.8 GB2 
et-enEstonian–English 2016-11-182.2 GB2 
eu-enBasque–English 2016-11-18877 MB2 
fa-enPersian–English 2016-11-181.3 GB2 
fi-enFinnish–English 2016-11-182.6 GB2 
fr-enFrench–English 2016-11-184.0 GB2 
ga-enIrish–English 2016-11-18866 MB2 
gl-enGalician–English 2016-11-18879 MB2 
ha-enHausa–English2016-11-18853 MB2 
he-enHebrew–English2016-11-181.4 GB2 
hi-enHindi–English 2016-11-18858 MB2 
hr-enCroatian–English 2016-11-181.4 GB2 
hu-enHungarian–English 2016-11-182.0 GB2 
id-enIndonesian–English2016-11-181.4 GB2 
is-enIcelandic–English 2016-11-181.1 GB2 
it-enItalian–English 2016-11-183.9 GB2 
ka-enGeorgian–English 2016-11-18849 MB2 
ku-enKurdish–English 2016-11-18827 MB2 

lt-en

Lithuanian–English2016-11-182.0 GB2 
lv-enLatvian–English 2016-11-182.0 GB2 
mg-enMalagasy–English 2016-11-18907 MB2 
mk-enMacedonian–English 2016-11-181.4 GB2 
ml-enMalayalam–English 2016-11-18851 MB2 
ms-enMalay–English 2016-11-181014 MB2 
mt-enMaltese–English 2016-11-181.4 GB2 
nl-enDutch–English 2016-11-183.6 GB2 
no-enNorwegian–English 2016-11-181.4 GB2 
pl-enPolish–English 2016-11-182.8 GB2 
pt-enPortuguese–English 2016-11-184.5 GB2 
ro-enRomanian–English 2016-11-182.5 GB2 
ru-enRussian–English 2016-11-181.9 GB2 
ru-enRussian-English 4.4 GB2Language model data sources can be found within the artifact README file
sd-enSindhi–English 2016-11-18837 MB2 
si-enSinhala–English 2016-11-18862 MB2 
sk-enSlovak–English 2016-11-182.4 GB2 
sl-enSlovenian–English 2016-11-182.3 GB2 
so-enSomali–English 2016-11-18850 MB2 
sq-enAlbanian–English 2016-11-181.3 GB2 
sr-enSerbian–English2016-11-181.5 GB2 
sv-enSwedish–English2016-11-183.4 GB2 
sw-enSwahili–English 2016-11-18859 MB2 
ta-enTamil–English 2016-11-18832 MB2 
te-enTelugu–English 2016-11-18823 MB2 
tg-enTajik–English 2016-11-18851 MB2 
th-enThai–English 2016-11-181.1 GB2 
tr-enTurkish–English 2016-11-181.4 GB2 
tt-enTatar–English 2016-11-18840 MB2 
ug-enUighur–English 2016-11-18838 MB2 
uk-enUkrainian–English 2016-11-18984 MB2 
ur-enUrdu–English2016-11-18866 MB2 
vi-enVietnamese–English2016-11-181.2 GB2 

Using Language Packs

Once you download the model, unpack it. The simplest use-case is then to run Joshua as a standard UNIX tool, accepting a single line of input and writing a single line of output. Assuming your language pack is downloaded to "apache-joshua-language-pack.tgz":

...

  • "-m XXg" — increase the amount of memory provided to Joshua. The default is 8g, but for the larger language packs, you will want 16 or 24. In general, 50% more memory than the raw model size should be more than sufficient.
  • "-top-n N" — output up to N translation candidates, instead of just one.
  • "-output-format STRING" — change the output format. By default, Joshua outputs just the single, tokenized translation with the highest model probability. 
    Here are some other options:
    • %s: the raw translated string
    • %S: the detokenized translated string
    • %e: the source string
    • %i: the sequence number (0-indexed)
    • %c: the model score
    • %f: the feature string
    These can all be combined in a single string, e.g., -output-format "%i ||| %s ||| %f ||| %c"

Versions

The language pack version history:

VersionDescriptionRelease Date
3Includes KenLM language model files (recommended) in addition to BerkeleyLM. The latter is the default, with the former recommended and facilitated with a Docker container. Google API now multithreaded.March 2017
2Contains a "joshua" top-level script and "prepare.sh" for preparing data. Operates in server mode or from the command line. Entirely BerkeleyLM-based. Includes a Joshua 6.1 release candidate jar file.November 2016

 

Citation

Please cite the following paper if you use Joshua in your research.

...