Statistical Machine Translation with Apache Joshua (Incubating)
The page provides detail on how to use Apache Joshua (Incubating) to undertake statistical machine translation (STM) via the Tika.translate API. This work is the result of development that has taken place both through TIKA-1343 Create a Tika Translator implementation that uses JoshuaDecoder and via close work within the Joshua community.
The benefits of using this approach for achieving language translation through Tika are as follows;
- It's free! As opposed to several other translation services currently available via Tika, STM via Joshua is free. You build the language models, you set up and manage the infrastructure and you have 100% control over the resulting translation
- You are not restricted under some usage ceiling. As there is no paid service, you can use this method completely unrestricted.
- The language model generation and quality are completely transparent. Nowadays a large issue with the use of statistical models (or more generally any models utilized within learning processes) is typically not shared and hence it is difficult to fully quantify or justify the results you get. For example, if we were to use Google Translate, we have absolutely no insight into how the translations are undertaken, what accuracy they achieve, etc. The method and work which is proposed here address this concern entirely. Everything is 100% transparent.
The downsides of using this approach are as follows;
- Joshua, the underlying STM toolkit is quite a complex piece of software. This should by no means be a surprise... after all STM is an extremely difficult and active research area. Some of the world's largest companies e.g. Google, Yahoo!, Bing, IBM, etc are investing large sums of money and significant resources trying to address the issues. The fact that we have STM available via Tika is a huge step towards building the STM open source community.
- Depending upon your translation requirements, you may be required to build your own language models. This however depends on which models are available via the Joshua community. If you do need to build your own models/language packs, this is not exactly a trivial process however you can find loads of help on this topic over on the Joshua mailing lists.
- Depending on the availability of good hardware, you may encounter performance issues. The loading of large language models, STM tasks generally, and building new language packs tend to benefit from powerful machines with lots of RAM. If this is not available then you may encounter issues.
With the above in mind, let us continue with configuring and provisioning Tika for STM with Joshua.
Step 1: Retrieve the Joshua Language Pack
In this example, we will be using a Spanish-to-English n-gram language model pack which was generated on October 6th 2016 and built using BerkeleyLM. For more detail on the language pack itself and how it was produced, see the Language Pack Details.
Grab the language pack and set it up as follows
To run the language pack, invoke the command
The Joshua decoder will start running, accepting input from STDIN and writing to STDOUT. Joshua expects its input in the form of a single sentence per line. Each sentence should first be piped through
prepare.sh, which normalizes and tokenizes the input for the language pack's source language. Here we have a passage.txt for you to try. If you run out of memory then please increase it within the
It takes some time (sometimes as much as a minute) to load all of the models into memory, which means there is high latency from startup until the first translation. To reduce this time, Joshua can also be run in server mode, implementing either a direct TCP-IP interface, or implementing a Google-translate style RESTful API. To run Joshua as a TCP-IP server, add the option
You can then connect directly to the socket using nc or telnet:
Take a look at
output.txt and you will see your translated passage... pretty cool eh?
You can set the RESTful interface by also passing '-server-type http':
The RESTful interface is used when running the browser demo (see
apache-joshua-es-en-2016-10-06/web/index.html) or when using the Joshua Translation Engine.
Step 2: Using the Joshua Translation Engine
In this step we establish a translation engine server written in Python that provides translations of documents http requests as responses to http requests. It provides a very convenient way for establishing a remotely accessible translation service which can be used in a RESTful manner which is exactly what Tika does when configured to use the JoshuaNetworkTranslator implementation.
Lets check out the Joshua Translation Engine source and start the service
The Python application can also print us basic help instructions to explain the above parameters
Once the above is stable you can then progress with configuring Tika!
Step 3: Configure and Provision Apache Tika
So now let us grab the Tika source, configure, compile and deploy it such that we can utilize Joshua's STM functionality.
We can now utilize the Tika Translate REST API to undertake language translation for us.
The response you receive should be the translated text for the passage you posted. Pretty neat eh?
If you have issues with the Tika components of this document, you can get help from the Tika community mailing lists. If you have issues with the Joshua components of this document, you can get help from the Joshua community mailing lists.
Language Pack Details
The language pack being used in this example was trained on the following bitexts, with a 4-gram language model built over the target side of the bitext. Many of the parallel documents were sources from OPUS, A Collection of Multilingual Parallel Corpora with Tools and Interfaces (http://opus.lingfil.uu.se/).
- Europarl version 7. Proceedings of the European Parliament.
- DGT: A collection of translation memories provided by the JRC.
- EMEA. PDF documents from the European Medicines Agency.
- Global Voices. Parallel news stories.
- JRC-Aquis: legislative text of the European Union from 1950 on.
- News Commentary v10. News commentaries provided by WMT.
- Fisher & Callhome Spanish. Translations of conversational Spanish.