Prerequisites

DockerHub Repository and Access

To be able to release the Apache Tika Docker image on DockerHub you will need to have access to the apache/tika repository. This is controlled by the ASF Infra team and can be requested through a INFRA JIRA ticket. Make sure to tag the ticket with the Docker label.

tika-docker repo

This repository contains the Dockerfiles used to create the minimal and full images for Apache Tika. Its also containers helper examples and configurations.

General Information

Image Types

There are two image types:

  • Minimal - containing just Apache Tika and it's base dependencies (i.e. Java)
  • Full - containing Apache Tika, it's dependencies, as well as Tesseract and GDAL.

The Dockerfile for each image is in the correspondingly named directory, and are the only assets used to public the images.

Docker Compose Files

There are a number of Docker Compose files to allow users to quickly test certain scenarios:

  • Recognising and Captioning Video and Images with TensorFlow REST (see here)
  • Enriching Academic PDF Parsing with Grobid REST (see here)
  • OCR of PDF or Images with Tesseract including a Custom Configuration (see here)
  • Named Entity Recognition (see here)

These different scenarios use the corresponding configuration in the sample-configs directory.

Neither these Docker Compose YML files or the Sample Configurations are used for publishing Apache Tika's Docker image. They are only used to provide examples for complex configurations.

An example of using these is provided here.

docker-tool.sh

This shell file is a helper script used to simplify the building, testing and publication of the images.

It provides the following options:

  • build - to build a minimal and full image of the passed in version
  • test - to verify the built image can start and the version number be received back
  • publish - to build the multi-arch images and publish the images on DockerHub (only for those who have access to the DockerHub repo)
republish-images.sh

This shell file was used to republish the older images when the Dockerfile was updated. It is redundant now but kept in the repo incase something similar needs done in the future.

Release Process

  1. Update the README.md's  Available Tags section

  2. Update the TAG version in .env to be X.Y.Z.Q+1

  3. Update CHANGES.md to include this release, changes and release date
  4. Test the release as in the example below
  5. Commit the changes
  6. To release a new version of Apache Tika on DockerHub, you can follow the below steps (replacing 3.1.0 with the version number you wish to publish). The first three numbers in the version represent the Tika version, and the last number represents the docker version – in 3.1.0.2, that's Tika 3.1.0 and a docker image version of 2. Building the multi-arch images takes a while. Might be time to get a carafe of coffee while waiting.
$ git clone https://github.com/apache/tika-docker && cd tika-docker
$ ./docker-tool.sh build 3.1.0.0 3.1.0
$ ./docker-tool.sh test 3.1.0.0

# If you see the test passed, you can then build the multi-arch images and publish them:
# NOTE THAT THIS STEP ALSO PUSHES THE *-latest tag. You may have to adjust the build script if you're pushing a BETA release or a release from an older branch (e.g. 2.x)!
$ ./docker-tool.sh publish 3.1.0.0 3.1.0

        6. If everything worked, tag the last commit

    1. git tag -a 3.1.0.0 -m "New release for 3.1.0.0"
    2. git push  --tags
  • No labels