Introduction
The page provides details on how to translate documents (via the Tika.translate API) using Reader Translator Generator, a neural machine translation toolkit.
The benefits of using this approach for machine translation through Tika are as follows;
- It's free! As opposed to several other translation services currently available via Tika, NMT via RTG is free.
- You are not restricted under usage ceiling, and you don't have to allocate monthly payments. There is no paid service behind the scene, you can use this method completely unrestricted.
- You will have full control over the whole pipeline.You may either build NMT models or download pretrained models, set up server and manage backend.
- Your data and documents are not sent to any services outside of your pipeline. So you can guarantee privacy of your data.
Though, you have to keep these in mind:
- Though you may run the model on CPU for testing, the translation will be very slow on CPUs. GPUs are highly recommended.
- NMT models are not interpretable and explainable. We cannot explain or guarantee that the translations are 100% correct. This is not specific to RTG/NMT; it is generally true for all neural machine translation services.
This is relatively a new addition; the following translation models are currently available:
To train models for your desired translation direction, please refer to the documentation at https://isi-nlp.github.io/rtg/#_usage
Integration: Overview
The class
org.apache.tika.language.translate.RTGTranslator
glues Tika system with RTG REST API.By default, it interacts with http://localhost:6060.
This URL can be customized by adding
translator.rtg.properties
file to classpath with rtg.base.url
property.500 Languages to English Translation
Step 1: Start RTG Translator Service
500-English model can be obtained from a docker image as follows
Docker image can be run on CPU (i.e. without GPU, for testing): docker run --rm -i -p 6060:6060 tgowda/rtg-model:500toEng-v1
Using GPU (e.g. Device 0) is recommended for translating a lot of documents:
docker run --rm -i -p 6060:6060 --gpus '"device=0"' tgowda/rtg-model:500toEng-v1
Verify that the translator serive is actually running by accessing http://localhost:6060/
Step 2: Start Tika Server Jar
Option 1: Obtain prebuilt jar
Note: This option is for the future versions. The current prebuilt jars do not have this feature integrated. Go to Option 2.
wget https://www.apache.org/dyn/closer.cgi/tika/tika-server-2.0.0.jar
Option 2: Build Tika Server from source
$ git clone https://github.com/apache/tika.git
$ cd tika
# if the pull request is not merged yet; please pull from this repo
$ git checkout -b TIKA-3329
$ git pull https://github.com/thammegowda/tika.git TIKA-3329
# Compile and package Tika
$ mvn clean package -DskipTests
# Start Tika server
$ java -jar tika-server/target/tika-server-2.0.0-SNAPSHOT.jar
Step 3:Translate Documents via Tika + RTG
printf "Hola señor\nನಮಸ್ಕಾರ\nBonjour monsieur\nПривет\n" > tmp.txt
$ curl http://localhost:9998/translate/all/org.apache.tika.language.translate.RTGTranslator/x/eng -X PUT -T tmp.txt
Hi, sir.
Namaskar
Good morning, sir.
Hi.
Optional: Change the base URL of RTG translator service
You may deploy RTG service elsewhere (on a machine with GPU) and point its URL to tika.
Step 1: Create a file named translator.rtg.properties
with rtg.base.url
property
echo "rtg.base.url=http://<myhost>:<port>/rtg/v1" > translator.rtg.properties
Step 2: Add the directory having translator.rtg.properties
to classpath; In this case . i.e, $PWD
java -cp '.:tika-server/target/tika-server-2.0.0-SNAPSHOT.jar' org.apache.tika.server.TikaServerCli
Step 3: Interact with Tika Server as usual
$ curl http://localhost:9998/translate/all/org.apache.tika.language.translate.RTGTranslator/x/eng -X PUT -T tmp.txt
Acknowledgements
If you wish to acknowledge or reference either RTG toolkit or the 500-English model, please reference this article: https://arxiv.org/abs/2104.00290
@misc{gowda2021manytoenglish,
title={Many-to-English Machine Translation Tools, Data, and Pretrained Models},
author={Thamme Gowda and Zhao Zhang and Chris A Mattmann and Jonathan May},
year={2021},
eprint={2104.00290},
archivePrefix={arXiv},
primaryClass={cs.CL}
}