Running UIMA Apps on Hadoop

Problem:

To make a simple UIMA app work over hadoop

Assumption:

You have tested hadoop and you have it running
You have a standalone UIMA app which has been tested

How TO

Let the UIMA be a simple nameAnnotation example which uses a type system nameType for name annotation. Let the descriptors for them be nameAnnotator.xml and nameType.xml.
Write a map and reduce classes within the application along with a job specifier.
Via these map/reduce class you aim to annotate the input value which they are recieving
Create a job jar out of the application.
Run this over hadoop

It will not work. There are several other things which has to be taken care of before

Important Consideration (Before creating/running the Job Jar over hadoop)

The jar file created should shave all the classes, descriptors of the UIMA app along with the map/reduce and job main class
All import in the descriptor declared in UIMA (be it analysis engine, agg engine, cas consumer etc) should be import by name.
Any such activity which involves reading of a resource should be done using the Classloader:
For eg. Reading an xml source should be done via XMLInputSource in = new XMLInputSource(ClassLoader.getSystemResourceAsStream(aeXmlDescriptor),null) i.e. inputstreams should be created using classloader
Last but not the least ResourceManager should be used while producing any analysis engine/ cas consumer etc.

          E.g. ResourceManager rMng=UIMAFramework.newDefaultResourceManager();
                rMng.setExtensionClassPath(str, true); //Here str is the path to any of the resources which can be obtained via
                                                                              //ClassLoader.getSystemResource(aeXmlDescriptor).getPath()
                rMng.setDataPath(str);
                aEngine = UIMAFramework.produceAnalysisEngine(aSpecifier,rMng,null);

           This 4th point has to be considered as when we read a xml without using classloader by default it reads from temp task directory i.e..
               /tmp/hadoop-root/mapred/local/taskTracker/jobcache/job_200806112341_0002/task_200806112341_0002_m_000000_0/
           But all the resources gets unjarred in
              /tmp/hadoop-root/mapred/local/taskTracker/jobcache/job_200806112341_0002/work
           directory
          So to tell the system to look out for hadoop in the correct directory we have to use Resource Manager. Actually this is required to take
          care of the the resources which UIMA will try to load because of the imports present in its various descriptors

Child pages

5 Comments

Julien Nioche

Saswat Kumar Sethy

arash

Anonymous

arash