This is the request object that is submitted to tika-server's /pipes and /async handlers in Tika 2.x. Users send a single FetchEmitTuple to the /pipes endpoint, and they send a list of FetchEmitTuples to the /async endpoint.

Minimal

This is the minimal information that needs to be included:

{
    "fetcher": "fsf",
    "fetchKey": "hello_world.pdf",
    "emitter": "fse",
    "emitKey": "hello_world.pdf.json"
}

The fetcher is the name of the fetcher to be used.  This name must be defined in the tika-config.xml file that is passed into tika-app or tika-server.  The key is the key to the file to be parsed.  The emitter is the name of the emitter to be used.  This name must be defined in the tika-config.xml file as well.  The key is the key to use when writing the extracted text+metadata.


Id

As default the fetchKey is used as the id for logging.  However, if users need a distinct task id for the request, they may add an id element:

{
    "id": "myTaskId",
    "fetcher": "fsf",
    "fetchKey": "hello_world.pdf",
    "emitter": "fse",
    "emitKey": "hello_world.pdf.json"
}

HandlerConfig

Users may specify limitations and format of extracted text with the HandlerConfig:

{
    "fetcher": "fsf",
    "fetchKey": "hello_world.pdf",
    "emitter": "fse",
    "emitKey": "hello_world.pdf.json",
    "handlerConfig": {
        "maxEmbeddedResources": 10,
        "type": "xml",
        "writeLimit": 10000
    }
}

UserMetadata

Users may inject external metadata into the output of the extract.  Users specify this in the metadata element with key/value or key/values:

{
    "emitKey": "emitKey1",
    "emitter": "my_emitter",
    "fetchKey": "fetchKey1",
    "fetcher": "my_fetcher",
    "handlerConfig": {
        "maxEmbeddedResources": 10,
        "type": "xml",
        "writeLimit": 10000
    },
    "id": "myTaskId",
    "metadata": {
        "m1": [
            "v1",
            "v2"
        ],
        "m2": "v3:
    }
}


  • No labels