This is the request object that is submitted to tika-server's /pipes
and /async
handlers in Tika 2.x. Users send a single FetchEmitTuple to the /pipes
endpoint, and they send a list of FetchEmitTuples to the /async
endpoint.
Minimal
This is the minimal information that needs to be included:
{ "fetcher": "fsf", "fetchKey": "hello_world.pdf", "emitter": "fse", "emitKey": "hello_world.pdf.json" }
The fetcher is the name of the fetcher to be used. This name must be defined in the tika-config.xml file that is passed into tika-app
or tika-server
. The key is the key to the file to be parsed. The emitter is the name of the emitter to be used. This name must be defined in the tika-config.xml file as well. The key is the key to use when writing the extracted text+metadata.
Id
As default the fetchKey
is used as the id
for logging. However, if users need a distinct task id for the request, they may add an id
element:
{ "id": "myTaskId", "fetcher": "fsf", "fetchKey": "hello_world.pdf", "emitter": "fse", "emitKey": "hello_world.pdf.json" }
HandlerConfig
Users may specify limitations and format of extracted text with the HandlerConfig
:
{ "fetcher": "fsf", "fetchKey": "hello_world.pdf", "emitter": "fse", "emitKey": "hello_world.pdf.json", "handlerConfig": { "maxEmbeddedResources": 10, "type": "xml", "writeLimit": 10000 } }
UserMetadata
Users may inject external metadata into the output of the extract. Users specify this in the metadata
element with key/value or key/values:
{ "emitKey": "emitKey1", "emitter": "my_emitter", "fetchKey": "fetchKey1", "fetcher": "my_fetcher", "handlerConfig": { "maxEmbeddedResources": 10, "type": "xml", "writeLimit": 10000 }, "id": "myTaskId", "metadata": { "m1": [ "v1", "v2" ], "m2": "v3: } }