Problem:

To make a simple UIMA app work over hadoop 

Assumption:

  1. You have tested hadoop and you have it running
  2. You have a standalone UIMA app which has been tested

How TO 

  1. Let the UIMA be a simple nameAnnotation example which uses a type system nameType for name annotation. Let the descriptors for them be nameAnnotator.xml and nameType.xml.
  2. Write a map and reduce classes within the application along with a job specifier.
  3. Via these map/reduce class you aim to annotate the input value which they are recieving
  4. Create a job jar out of the application.
  5. Run this over hadoop

It will not work. There are several other things which has to be taken care of before

Important Consideration (Before creating/running the Job Jar over hadoop)

  1. The jar file created should shave all the classes, descriptors of the UIMA app along with the map/reduce and job main class
  2. All import in the descriptor declared in UIMA (be it analysis engine, agg engine, cas consumer etc) should be import by name.
  3. Any such activity which involves reading of a resource should be done using the Classloader:
    For eg. Reading an xml source should be done via                                                                                                                                                                                                                         XMLInputSource in = new XMLInputSource(ClassLoader.getSystemResourceAsStream(aeXmlDescriptor),null)                                                                                                                                           i.e. inputstreams should be created using classloader
  4. Last but not the least ResourceManager should be used while producing any analysis engine/ cas consumer etc.

            E.g. ResourceManager rMng=UIMAFramework.newDefaultResourceManager();
                rMng.setExtensionClassPath(str, true); //Here str is the path to any of the resources which can be obtained via
                                                                              //ClassLoader.getSystemResource(aeXmlDescriptor).getPath()
                rMng.setDataPath(str);
                aEngine = UIMAFramework.produceAnalysisEngine(aSpecifier,rMng,null);
 
           This 4th point has to be considered as when we read a xml without using classloader by default it reads from temp task directory i.e..
               /tmp/hadoop-root/mapred/local/taskTracker/jobcache/job_200806112341_0002/task_200806112341_0002_m_000000_0/ 
           But all the resources gets unjarred in
              /tmp/hadoop-root/mapred/local/taskTracker/jobcache/job_200806112341_0002/work          
           directory 
          So to  tell the system to look out for hadoop in the correct directory we have to use Resource Manager. Actually this is required to take
          care of the the resources which UIMA will try to load because of the imports present in its various descriptors

  • No labels

5 Comments

  1. The procedure above works in distributed mode only. In standalone mode the job jar is not extracted to the job_xxxx/jars directory.

    It seems to be working when using relative paths for the descriptors.

  2. Please more explain about "How Running UIMA Apps on Hadoop" with example?
    Thnx

  3. Anonymous

    "1. The jar file created should shave all the classes, descriptors of the UIMA app along with the map/reduce and job main class"

    I think you mean "The jar file created should have all the classes...". Is that correct?

  4. Explanation of this article is in general and i can't understand. Please more explain with details and example.