UIMA Architecture Compliance Issues
This page captures areas where the Apache UIMA implementation is not in sync with the proposed UIMA specification being developed at OASIS. Note that the UIMA specification is still in the very early stages of development and may evolve.
Section number references refer to the UIMA Specification Proposal Whitepaper available at http://domino.research.ibm.com/library/cyberdig.nsf/papers/1898F3F640FEF47E8525723C00551250/$File/rc24122.pdf.
Note: Much of the content of this page is copied verbatim from the Whitepaper.
Type System Representation Language and CAS Terminology
(See Section 5.1.2)
Apache UIMA implemented its own type-system representation language prior to adopting Ecore as the UIMA specification. The original Apache UIMA type system representation uses some different terminology:
- Apache UIMA does not use the term "Class", instead using the term "Type" to refer to both primitive and nonprimitive types.
- The type of a feature is referred to in Apache UIMA as the feature's "range type".
- Apache UIMA uses the term "FeatureStructure" to refer to what this specification calls an "Object."
Also, Apache UIMA's original type-system representation language does not directly implement multi-valued features. Instead it has explicit array and list types.
The current version of Apache UIMA now supports the proposed UIMA standard type-system language (i.e., Ecore). We use an Ecore "Annotations" (arbitrary tags attached to Ecore model elements) to record Apache UIMA-specific information such as whether an array or list will be used to implement a multi-valued feature.
XMI Serialization of Arrays and Lists
(See Section 126.96.36.199)
As noted previously, Apache UIMA's original type-system representation language does not directly implement multi-valued features. Instead it has array and list types. For a feature whose range type is one of Apache UIMA's array or list types, it is usually appropriate to serialize this to XMI as a multi-valued feature. However, since arrays and lists are first-class objects in Apache UIMA, it is possible to have multiple references to the same array or list, which is not compatible with the multi-valued feature representation.
To address this, the Apache UIMA Type System Description has an additional attribute multipleReferencesAllowed that can be set for a feature. An array or list with multipleReferencesAllowed = false (the default) is serialized as a multi-valued feature in XMI. An array or list with multipleReferencesAllowed = true is serialized as a separate object and referenced from the containing object.
Apache UIMA v1.4 and later support the proposed UIMA standard type-system language (i.e., Ecore). This ambiguity does not arise for type systems developed directly in Ecore
Type Definitions for Sofas
(See Section 5.3.3)
The UIMA Spec does not define
Sofa as a type in the type system. Instead, conceptually any slot on any object in the CAS could be a subject of analysis.
To implement this, the UIMA Spec defines the type
SofaReference. An Annotation's
sofa feature points to an instance of type
SofaReference. There are two subtypes: A
LocalSofaReference is a reference to a slot of another object in the CAS (it has two fields - an object reference and a string slot name). A
RemoteSofaReference is a URI to content that is not contained in the CAS.
Regional References for Annotations
(See section 188.8.131.52)
The UIMA spec suggests, but does not mandate, the use of an extensible
RegionalReference type. For example subtypes might be
AudioRegionalReference. An annotation type such as
PersonAnnotation could refer to either kind of regional reference. See the spec for a detailed discussion of the pros and cons of this approach.
Apache UIMA does not implement a separate RegionalReference type. Instead, for text annotations Apache UIMA defines a type named uima.tcas.Annotation that contains the features begin and end. These are intended to represents off-sets into the text string specified by the annotation's sofa feature. The type uima.tcas.Annotation, however, is not extensible to non-text artifacts. Furthermore, the begin and end features are UTF 16 code units which is not convenient for anyone using UTF-8 for example.
(See Section 184.108.40.206)
UIMA Spec Definition of a View: A View is a named collection of objects in a CAS. In general a view can represent any subset of the objects in the CAS for any purpose. It is intended however that Views represent different perspectives of the artifact represented by the CAS. Each View is intended to partition the artifact metadata to capture a specific perspective.
UIMA Spec Definition of an Anchored View: A common and intended use for a View is to contain metadata that is associated with a specific interpretation or perspective of an artifact. An application, for example, may produce an analysis of both the XML tagged view of a document and the de-tagged view of the document.
AnchoredView is as a subtype of View that has a named association with exactly one particular object via the standard feature sofa.
An AnchoredView requires that all Annotation objects that are members of the AnchoredView have their sofa feature refer to the same SofaReference that is referred to by the View's sofa feature.
Simply put, all annotations in an AnchoredView annotate the same subject of analysis.
Summary of view-related differences between UIMA Spec and Apache UIMA:
1. In Apache UIMA all views are AnchoredViews. The sofa feature of a View points to an instance of the Sofa type. There is exactly one View per Sofa. The intention is that a View contains all objects that are relevant to its Sofa.
2. Apache UIMA enforces the anchored view constraint (that all annotations in the view refer to the same sofa as the view itself), and most Apache UIMA analytics rely on the assumption that the constraint is satisfied.
3. Apache UIMA defines CAS APIs that operate specifically on Views. For example there is a method CAS.getView(Sofa) through which an annotator can get the View containing objects relevant to a particular Sofa.
4. Apache UIMA defines view indexes, which provide efficient iteration over the members of the view according to a sort order defined declaratively by the user. (Note this is an index over the contents of a single CAS View, and is not the same as for example an inverted file index that indexes the contents of multiple CASes.) For example Apache UIMA annotators frequently use view indexes to iterate over Annotations in a text document, in order from the beginning of the document to the end.
5. In Apache UIMA, the API to a CAS is the same as the API to a View. (This can be done because both are collections of objects.) This is not ideal since it blurs the distinction between a CAS and a View. The intended distinction is that a CAS contains Views. Apache UIMA may be made to more closely reflect the proposed UIMA specification by providing a View interface which is distinct from the CAS interface. For example the CAS interface could provide getView() methods but not indexes, while Views do the opposite.
Behavioral Metadata (Capabilities)
(See Section 5.4)
UIMA Spec's Definition
The UIMA Spec does not yet have a final proposal for the exact language for representing the behavior of a component. However, it does propose that behavioral metadata consists of the following kinds of declarations:
1. Precondition: A predicate that qualifies CASs that the analytic considers valid input. More precisely the analytic's behavior would be considered unspecified for any CAS that did not satisfy the pre-condition. The pre-condition may be used by a framework or application to filter or skip CASs routed to an analytic whose pre-condition is not satisfied by the CASs. A human assembler or automated composition process can interpret the pre-conditions to determine if the analytic is suitable for playing a role in some aggregate composition. For example, if the pre-condition requires that valid input CASs contain People, Places and Organizations and the assembler knows that they will not, then the analytic is clearly not suitable for the intended operation.
2. Capability: A description of the intended effects of the analytic's operation on subsets of valid input CASs. The description need not completely specify analytic behavior but rather describe the results that the analytic is capable of providing. The capability description may break down into the following parts:
a. Analyzes: A predicate that defines the subjects of analysis (sofas) that the analytic can analyze. For example, an analytic may declare that it analyzes instances of type Person, or it may declare that it analyzes instances of type Sofa whose mimeTypeMajor is "text". This expression may identify a single object or a collection of objects.
b. Inspects: A predicate that identifies the collection of objects which the analytic may consult while doing its analysis. If an object is NOT a member of the inspects or analyzes predicates, then a framework or application is permitted to filter this information (perhaps as an optimization for remote transport of the CAS). The inspects predicate may specify that all content in the CAS will be inspected.
c. Creates: An expression that identifies objects that an analytic may create as a result of its analysis. For example, an analytic may declare it creates instances of type Organization with their sector feature equal to "Financial".
d. Modifies: An expression that identifies objects and or slots that an analytic may modify.
e. Deletes: An expression that identifies objects that an analytic may delete.
3. Post-Condition: An analytic developer should be able to declare a post-condition that the developer asserts will be true of any CAS after having been processed by the analytic, assuming that the CAS satisfied the precondition when it was input to the analytic.
On Multiple Capability Sets
The UIMA Spec has this to say:
We have considered whether we should allow components to specify sets of (Precondition, Capability, Postcondition) declarations. That is, if the CAS satisfies Precondition 1, the component can perform Capability 1; if the CAS satisfies Precondition 2, the component can perform a different Capability 2.
The trouble with this approach is that we would need to specify what happens if a CAS satisfies more than one precondition. Are all operations performed? If so, in what order are they performed? Similarly, what if a CAS originally satisfied only Precondition 1 only, but after the analytic had performed Capability 1 the CAS now satisfied Precondition 2. Should the analytic now perform Capability 2?
These issues hinder composition because it is no longer clear what the precondition and post condition are for the analytic as a whole. Therefore we have proposed only a single (Precondition, Capability, Postcondition) declaration for each analytic.
Allowing multiple capability sets would require a more complex behavioral model, and that burden would be transmitted to the application or flow controller that wanted to consider the behavioral metadata.
If we wish to allow multiple sets of capabilities then we need to extend the specification so that it provides explicit answers to the questions in the second paragraph of this discussion point.
Comparison to Apache UIMA
Apache UIMA capability specifications are able to express the following kinds of conditions in each of the categories:
1. Preconditions: The only precondition that Apache UIMA supports is languagesSupported, which is a check against the language feature of the built-in type uima.tcas.DocumentAnnotation.
2. Analyzes: can specify multiple "input sofas" to a component. The names declared by the analytic are matched against the sofaID feature of the built-in type uima.cas.Sofa.
3. Inspects: can specify the names of type and features that the analytic will inspect.
4. Creates: can specify the names of types that the analytic may create.
5. Modifies: can specify the names of features that the analytic may modify.
6. Deletes: cannot be expressed
7. Postcondition: cannot be expressed
Apache UIMA analytics can declare multiple capabilities. The reason for this is to allow an analytic to declare different creates/modifies statements for different languagesSupported preconditions. This may be an issue if the UIMA spec decides not to allow multiple preconditions in a single analytic.
Further Issues with the Apache UIMA Capability Representation
In the following Apache UIMA capability specification:
It is an implicit assumption that the ex.Person and ex.Place objects will be members of the View named "SomeInputSofaName". (Or more precisely, the unique View associated with the Sofa that has that name.)
We can make the Apache UIMA semantics explicit by specifying the exact mapping from this capability representation to a set of OCL expressions.
Restricting the Subjects of Analysis (Sofa Mapping)
(See Section 5.4.4)
UIMA Spec's Definition
Note that the analyzes predicate of the Behavioral Specifications qualifies objects the analytic is capable of operating on. At runtime, an application or aggregate that calls the analytic may wish to direct the analytic to process only a particular set of objects that satisfy the analytic's analyzes predicate.
Handles declared in an Analytic's behavioral specification provide a hook whereby the caller of analytic may bind a specific set of objects to the handle. For example, consider an analytic that declares in its behavioral metadata:
<analyzes handle="ex1Analyzes"> select(s |
This analytic is declaring that it is capable of processing any instance of ex:TextDocument, and that it will use the handle ex1Analyzes to refer to the set of ex:TextDocument instances that it will analyze.
When we define the Analytic interface (see Section â€Ž5.6), we will provide a way for the caller of the analytic to specify that the handle ex1Analyzes should be bound to a particular set of ex:TextDocument instances that the caller wants the analytic to process.
For each analyzes predicate that the analytic defines, it may declare a different handle, which is a local name that identifies that analyzes predicate to this analytic. This allows the caller to specify a different set of objects to be bound to each analyzes predicate.
This binding of handles to objects by the caller serves two primary purposes:
1. It allows a framework or caller to provide a convenience to the analytic developer. Note that a framework may already evaluate OCL expressions in the analytic's behavioral spec in order to determine if the analytic's precondition is met. In that scenario it makes sense for the framework to make the results of that evaluation available to the analytic rather than force the analytic to recompute them.
2. It enables the caller to further restrict the collection bound to a handle. The need for this was discussed in the Requirements section above. For example in the XML above, this component declares that it can analyze any instance of ex::TextDocument. The caller may wish, however, to have only one particular instance of ex::TextDocument analyzed. The caller can indicate this by binding the ex1sofas handle to just that particular ex::TextDocument instance.
It only makes sense for the caller to bind input handles such as analyzes or inspects. It would not make sense for a caller to bind objects to a creates handle, since that handle refers to a set of objects produced by the analytic.
Declaration of handles is optional. If the analytic does not declare any handles then the caller cannot specify bindings. Also it is optional for a caller to provide bindings. If the bindings are not computed and sent along with the CAS, then the analytic must locate the required collections itself (using whatever APIs are provided to the CAS for example).
Candidate Compliance Point: A UIMA component/framework may be required to accept input CASes that do not include handle bindings. However, if handle bindings are provided, a UIMA compliant component/framework may be required to use them (e.g., to restrict its processing to only those objects that the caller has bound to the handle).
Comparison to Apache UIMA
Apache UIMA also provides Sofa Mapping as part of its aggregate specification, which is a kind of instance-level CAS data mapping that satisfies this "handle" requirement by allowing the aggregate assembler to map any Sofa in the CAS to the name expected by the analytic. Sofa mappings also allow an aggregate to guarantee unique Sofa names even if analytics create Sofa objects with the identical names.
Elements of Component Descriptor
(See Section 5.5.1)
Currently the Apache UIMA Component Metadata Descriptor includes the following elements that are not part of the proposed UIMA Specification.
1. Indexes: Defines the structure of indexes through which the analytic will access data. In some sense the actual indexing design is an Apache UIMA issue and so this may be an extension to the descriptor schema that is specific to Apache UIMA. However if we think of the index definitions as a component declaring the key features that it is going to use to query the data, we can make a case that this should be a UIMA standard, so that any framework could optimize based on this information.
2. Type Priorities: These are closely related to the index definitions and should probably be combined with them rather than represented as a separate element
3. External Resources: The core concept of external resource dependencies is captured using the "ResourceURL" configuration parameter type, discussed above. Other details of Apache UIMA's external resource mechanism are framework-dependent and not covered in the UIMA spec.
4. Configuration Parameter Settings: Default values for parameters are becoming part of the configuration parameter declarations. Specifying non-default values should not be done as part of the descriptor.
5. Operational Properties: (modifiesCas, outputsNewCASes, multipleDeploymentAllowed): These should be covered by the Behavioral Specification. The first two are fairly straightforward. The "multipleDeploymentAllowed" property states whether the component is "parallelizable". Usually components that maintain state across input CASes are not parallelizable and can't be multiply deployed. Will that be covered by the behavioral spec? We make significant use of this property in the Apache UIMA
(See Section 220.127.116.11)
UIMA Spec's Definition
For each configuration parameter we should allow the PE developer to specify:
1. The name of the parameter
2. A description for the parameter
3. The type of value that the parameter may take
4. Whether the parameter accepts multiple values or only one
5. Whether the parameter is mandatory
6. A default value or values for the parameter
One common use of configuration parameters is to refer to external resource data, such as files containing patterns or statistical models. Frameworks such as Apache UIMA may wish to provide additional support for such parameters, such as resolution of relative URLs (using classpath/datapath) and/or caching of shared data. It is therefore important for the UIMA configuration parameter schema to be expressive enough to distinguish parameters that represent resource locations from parameters that are just arbitrary strings.
We propose that the type of a parameter must be one of the following:
â€¢ Integer (32-bit)
â€¢ Float (32-bit)
Conmparison to Apache UIMA Implementation
Apache UIMA has a more extensive schema that allow for "configuration groups". For example this feature can be used to allow an annotator to use a different pattern file for English documents than for German documents. The annotator's descriptor would declare groups named "en" and "de" each containing a "PatternFile" parameter, like this:
<configurationGroup names="en de">
<description>Location of external file containing additional patterns to search for.</description>
The Apache UIMA API then allows an application to set a different value for this parameter in the "en" group than in the "de" group.
This feature does not get much use in Apache UIMA and adds a lot of complexity to framework implementations, so we have proposed leaving it out of the UIMA specification.
Type System Reference
(See Section 18.104.22.168)
The UIMA Spec proposes that an analytic descriptor should reference its type system via a URI. This is different than the Apache UIMA implementation, as follows:
1. Apache UIMA allows type systems to be defined directly inside an analytic descriptor, as well as by reference.
2. For Apache UIMA remote services, references to type systems are resolved during service deployment and "included" directly into the descriptor. When the service sends its metadata to a client, it sends the descriptor that directly includes the entire type system definition. Therefore the client never needs to initiate a second request to obtain the type system.
3. Apache UIMA has an "import" construct that can be used not only for type systems but also many other parts of the descriptor that may be reusable. Imports can by "by location" or "by name". An import by location is a URL; if the URL is relative then it is resolved relative to the descriptor containing the import. An import by name is a dotted name (as in a Java classname) that is looked up in the Java classpath. Several users have found this classpath look up very useful and a natural way to do things in Java. Are we now requiring URIs instead? Perhaps it is sufficient to use relative URLs in Apache UIMA descriptors (thus complying with the UIMA spec), and for Apache UIMA to resolve those relative URLs against the classpath or datapath.
(see section 5.6)
UIMA Spec's Definition
This diagram defines the abstract interfaces to the various types of UIMA components (collectively called Processing Elements). Refer to the spec for descriptions.
Apache UIMA binds the UIMA abstract interfaces to Java interfaces which may then be implemented by Apache UIMA component developers.
Apache UIMA uses the term "AnalysisEngine" where the UIMA Spec uses the term "Analytic". Apache UIMA further specializes the Analytic interfaces into different component types:
1. Analyzer is specialized to:
- Annotator, for analyzers that modify the CAS
- CasConsumer, for analyzers that do not modify the CAS
2. CasMultiplier is specialized to:
- CollectionReader, for CAS Multipliers that produce CASes that each represent an artifact from a collection.
The Apache UIMA FlowController interface also introduces a slightly different programming model. Apache UIMA defines a method FlowController.computeFlow(CAS), which is called when a new CAS first enters the aggregate. The computeFlow method returns an object of type Flow. The Flow object is dedicated to routing a particular CAS. The Flow interface defines a next() method which returns the next destination for this CAS (it can consult any information in the CAS to make this decision). With this programming model developers of Flow Controllers are insulated from the complexity of multiple CASes potentially flowing through an aggregate at the same time. However, we did not want to mandate the use of this programming model throughout all UIMA implementations, so a simpler FlowController interface is defined here. The Apache UIMA implementation can be easily adapted to the proposed standard UIMA interface.
Also, the Apache UIMA FlowController interface permits the FlowController to modify the CAS, whereas the Flow Controller interface in this specification does not.
Finally, note that all process methods take a CAS as input - nowhere in the UIMA spec does it say that a process method can take single View as input instead.
WSDL Service Interfaces
(See Section 5.7)
The actual WSDL definitions are just a very early proposal so it does not make sense to try to implement to this exact specification at this time. However, we should be aware that this will be required once the spec matures.