Grobid Quantities is a Java library used to recognize any expressions of measurements (e.g. pressure, temperature, etc.) in textual documents, parse, normalize and finally convert the measurements into SI units. It can be used on technical and scientific articles (text, XML and PDF input) and patents (text and XML input). To use its capabilities with Tika, one must install the server endpoint created for Grobid Quantities to extract measurement units from text passed to it.
Steps to install: Install Grobid Quantities by following the steps from github and make sure the quantity model is trained as per the instructions provided
After installing and training the model, start the REST server using the following command
$ mvn -Dmaven.test.skip=true jetty:run-war |
The server starts by default on port number 8080 and the server can be seen running on http://127.0.0.1:8080.
<?xml version="1.0" encoding="UTF-8"?> <properties> <parsers> <parser class="org.apache.tika.parser.ner.NamedEntityParser"> <mime>text/plain</mime> <mime>text/html</mime> <mime>application/xhtml+xml</mime> </parser> </parsers> </properties> |
grobid.server.url=http://localhost:8080 grobid.endpoint.text=/processQuantityText |
#Create a directory for keeping the config and properties file. export GROBID_QUANTITIES_RES=$HOME/GrobidQuantitiesRest-resources mkdir -p $GROBID_QUANTITIES_RES cd $GROBID_QUANTITIES_RES #config file must be stored in this directory pwd export PATH_PREFIX="$GROBID_QUANTITIES_RES/org/apache/tika/parser/ner/grobid" mkdir -p $PATH_PREFIX #create and edit the properties file vim $PATH_PREFIX/GrobidServer.properties |
export TIKA_APP={your/path/to/tika-app}/target/tika-app-1.13-SNAPSHOT.jar #set the system property to use GrobidNERecogniser class java -Dner.impl.class=org.apache.tika.parser.ner.grobid.GrobidNERecogniser -classpath $GROBID_QUANTITIES_RES:$TIKA_APP org.apache.tika.cli.TikaCLI --config=$GROBID_QUANTITIES_RES/tika-config.xml -m https://en.wikipedia.org/wiki/Time |