Tika and Computer Vision - Image Captioning
This page describes how to use the Image Captioning capability of Apache Tika. "Image captioning" or "describing the content of an image" is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. TIKA-2262 introduced a new parser to perform captioning on images. Visit TIKA-2262 issue on Jira or pull request on Github to see the related conversations. Currently, Tika utilizes an implementation based on the paper Show and Tell: A Neural Image Caption Generator for captioning images. This paper presents a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation that can be used to generate natural sentences describing an image. Continue reading to get Tika up and running for image captioning.
Tika and Tensorflow Image Captioning Using REST Server
We are going to start a python flask based REST API server and tell tika to connect to it. All the dependencies and setup complexities are isolated in the docker image.
Requirements :
Docker -- Visit Docker.com and install latest version of Docker. (Note: tested on docker v17.03.1)
Step 1. Setup REST Server
You can either start the REST server in an isolated docker container or natively on the host that runs tensorflow v1.0
a. Using docker (Recommended)
1 git clone https://github.com/USCDataScience/tika-dockers.git && cd tika-dockers
2 docker build -f Im2txtRestDockerfile -t uscdatascience/im2txt-rest-tika .
3 docker run -p 8764:8764 -it uscdatascience/im2txt-rest-tika
Once it is done, test the setup by visiting http://localhost:8764/inception/v3/caption/image?url=https://upload.wikimedia.org/wikipedia/commons/thumb/1/1d/Marcus_Thames_Tigers_2007.jpg/1200px-Marcus_Thames_Tigers_2007.jpg in your web browser.
Sample output from API:
{ "captions": [{ "confidence": 0.010706611316269087, "sentence": "a baseball player swinging a bat at a ball" }, { "confidence": 0.004686326913725872, "sentence": "a baseball player swinging a bat at a ball ." }, { "confidence": 0.0041084865981657155, "sentence": "a baseball player swinging a bat on a field" } ], "beam_size": 3, "max_caption_length": 20, "time": { "read": 407, "captioning": 1632, "units": "ms" } }
Note: MAC USERS:
- If you are using an older version, say, 'Docker toolbox' instead of the newer 'Docker for Mac',
you need to add port forwarding rules in your Virtual Box default machine.
- Open the Virtual Box Manager.
- Select your Docker Machine Virtual Box image.
Open Settings -> Network -> Advanced -> Port Forwarding.
- Add an appname,Host IP 127.0.0.1 and set both ports to 8764.
b. Without Using docker
If you chose to setup REST server without a docker container, you are free to manually install all the required tools specified in the docker file.
Note: docker file has setup instructions for Ubuntu, you will have to transform those commands for your environment.
1 python tika-parsers/src/main/resources/org/apache/tika/parser/captioning/tf/im2txtapi.py
Step 2. Create a Tika-Config XML to enable Tensorflow parser.
A sample config can be found in Tika source code at tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow-rest.xml
Here is an example:
<properties> <parsers> <parser class="org.apache.tika.parser.recognition.ObjectRecognitionParser"> <mime>image/jpeg</mime> <mime>image/png</mime> <mime>image/gif</mime> <params> <param name="apiBaseUri" type="uri">http://localhost:8764/inception/v3</param> <param name="captions" type="int">5</param> <param name="maxCaptionLength" type="int">15</param> <param name="class" type="string">org.apache.tika.parser.captioning.tf.TensorflowRESTCaptioner</param> </params> </parser> </parsers> </properties>
Description of parameters :
Param Name | Type | Meaning | Range | Example |
apiBaseUri | uri | HTTP URL that will be used to create apiUri & healthUri | any HTTP URL | http://localhost:8764/inception/v3 |
captions | int | Number of captions to output | a non-zero positive integer | 3 to recieve 3 captions |
maxCaptionLength | int | Maximum length of a caption | a non-zero positive integer(recommended >=15) | for 15 the sentence length of a caption won't be greater than 15 |
class | string | Name of class that Implements Object recognition Contract | constant string | org.apache.tika.parser.recognition.tf.TensorflowRESTCaptioner |
Step 3. Demo
$ java -jar tika-app/target/tika-app-1.17-SNAPSHOT.jar \ --config=tika-parsers/src/test/resources/org/apache/tika/parser/recognition/tika-config-tflow-im2txt-rest.xml \ https://upload.wikimedia.org/wikipedia/commons/f/f6/Working_Dogs%2C_Handlers_Share_Special_Bond_DVIDS124942.jpg
The input image is:
And, the output is
1 ...
2
3 INFO Available = true, API Status = HTTP/1.0 200 OK
4 INFO Captions = 5, MaxCaptionLength = 15
5 INFO Recogniser = org.apache.tika.parser.captioning.tf.TensorflowRESTCaptioner
6 INFO Recogniser Available = true
7 INFO minConfidence = 0.05, topN=2
8 INFO Time taken 1779ms
9 <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
10 <head>
11 <meta name="org.apache.tika.parser.recognition.object.rec.impl" content="org.apache.tika.parser.captioning.tf.TensorflowRESTCaptioner"/>
12 <meta name="X-Parsed-By" content="org.apache.tika.parser.CompositeParser"/>
13 <meta name="X-Parsed-By" content="org.apache.tika.parser.recognition.ObjectRecognitionParser"/>
14 <meta name="resourceName" content="Working_Dogs%2C_Handlers_Share_Special_Bond_DVIDS124942.jpg"/>
15 <meta name="Content-Length" content="295937"/>
16 <meta name="CAPTION" content="a man standing next to a dog on a leash . (0.00017)"/>
17 <meta name="CAPTION" content="a man standing next to a dog on a bench . (0.00017)"/>
18 <meta name="CAPTION" content="a man and a dog are sitting on a bench . (0.00014)"/>
19 <meta name="CAPTION" content="a man and a dog sitting on a bench . (0.00013)"/>
20 <meta name="CAPTION" content="a man and a dog are sitting on a bench (0.00009)"/>
21 <meta name="Content-Type" content="image/jpeg"/>
22 <title/>
23 </head>
24 <body><ol id="captions"> <li id="0"> a man standing next to a dog on a leash . [en](confidence = 0.000167)</li>
25 <li id="1"> a man standing next to a dog on a bench . [en](confidence = 0.000167)</li>
26 <li id="2"> a man and a dog are sitting on a bench . [en](confidence = 0.000138)</li>
27 <li id="3"> a man and a dog sitting on a bench . [en](confidence = 0.000131)</li>
28 <li id="4"> a man and a dog are sitting on a bench [en](confidence = 0.000092)</li>
29 </ol>
30 </body></html>
31 $
Questions / Suggestions / Improvements / Feedback ?
If it was useful, let us know on twitter by mentioning @ApacheTika
If you have questions, let us know by using Mailing Lists
If you find any bugs, use Jira to report them