Language packs are pre-built translation models with an included instance of the Joshua runtime environment. There are no dependencies, so installing a black-box machine translation system on your computer is as easy as downloading the link, opening up the tarball, and running the included shell script.
English–English (paraphrase)
These models are English to English translation engines trained on the Paraphrase Database. PPDB contains many different types of rules: lexical, phrasal, and syntactic. In addition, confidence measures computed over the entries have been used to filter the dataset at different thresholds, so that you can use only the highest-confidence rules (S) or get all of them (XXXL).
Rule set | S | M | L | XL | XXL | XXXL |
---|---|---|---|---|---|---|
All Rules | General (3.9 GB) | General (10 GB) | ||||
Lexical Rules | ||||||
Phrasal Rules | ||||||
Syntactic Rules |
Once you download the model, unpack it. The simplest use-case is then to run Joshua as a standard UNIX tool, accepting a single line of input and writing a single line of output:
tar xzvf apache-joshua-ppdb-2.0-s-all.tgz cat input.txt | ./apache-joshua-ppdb-2.0-s-all/joshua
There is some startup cost associated with the models, however. You may find it more beneficial, therefore, to run it as a server. Joshua can run in two server modes: raw TCP, and HTTP.
# start in server mode, answering web queries ./apache-joshua-ppdb-2.0-s-all/joshua -server-port 5674 cat input.txt | nc localhost 5674 # start in server mode, taking direct TCP/IP connections ./apache-joshua-ppdb-2.0-s-all/joshua -server-port 5674 -server-type tcp # Now open ./apache-joshua-ppdb-2.0-s-all/html/index.html in your browser
Decoder Options
Joshua supports many command-line options controlling its output. By default, it outputs only a single hypothesis per input line. Here are some options that may be useful to you:
- "-m XXg" — increase the amount of memory provided to Joshua. The default is 8g, but for the larger language packs, you will want 16 or 24. In general, 50% more memory than the raw model size should be more than sufficient.
- "-top-n N" — output up to N translation candidates, instead of just one.
- "-output-format STRING" — change the output format. By default, Joshua outputs just the single, tokenized translation with the highest model probability.
Here are some other options:- %s: the raw translated string
- %S: the detokenized translated string
- %e: the source string
- %i: the sequence number (0-indexed)
- %c: the model score
- %f: the feature string
Citations
Please cite the following paper if you use these models in your research.
@InProceedings{napoles-callisonburch-post:2016:N16-3,
author = {Napoles, Courtney and Callison-Burch, Chris and Post, Matt},
title = {Sentential Paraphrasing as Black-Box Machine Translation},
booktitle = {Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations},
month = {June},
year = {2016},
address = {San Diego, California},
publisher = {Association for Computational Linguistics},
pages = {62--66},
url = {http://www.aclweb.org/anthology/N16-3013}
}
Version History
- Version 1 (June 2016). Runtime: Joshua 6.1 snapshot.