CSS Stylesheet |
---|
body { margin-top: 1em; margin-bottom: 1em; margin-left: 1em; } p { font-family: "Palatino Linotype", "Times New Roman", Times, serif; font-size: 12pt !important; margin-left: 3em !important; } ul, ol { margin-left: 4em !important; } h1 { border-top:solid black 1.00pt;} h2 { margin-left: 1em; border-top:solid black .75pt; } h3 { margin-left: 2em } h4 { margin-left: 3em } |
Table of Contents | ||
---|---|---|
|
Background
This is a collection of ideas that have been discussed on the mailing lists and elsewhere, regarding configuring UIMA pipelines using an external configuration specification, associated with the top level of running a UIMA pipeline, in way that overrides the other conventional ways of configuring pipelines within the various annotator engine descriptors.
What running scenarios are contemplated?
The normal run scenario is one where you start up a pipeline, it initializes, and processes CASes and eventually finishes.
Alternatives: During the run, it is reconfigured. Parameters can be reconfigured by APIs; a similar kind of thing could be provided for later, if a need arises, to respecify the list of configuration settings files to reload.
What is configuration?
Configuration is a collection of things, set for a particular UIMA pipeline run. It can include things like conventional UIMA parameter settings, as well as other kinds of settings such as "placeholder" values that are substituted into UIMA deployment descriptors, debug flags, dump the CAS flags, the logging specification for the run, etc.
Configuring simple and complex values
Configuration parameters in UIMA have types like integer, float, string, boolean, or arrays of these. UIMA provides for arbitrarily complex Java Objects as configuration values, using the External Resource specification. The UIMA External Resources design also allows "sharing" of these complex objects among multiple annotators. Similarly, the normal configuration parameters allow "sharing" - that is, the UIMA parameter override design lets one parameter setting be connected to multiple parameters, down the nested hierarchy of annotators.
The same use cases motivating the setting of UIMA configuration parameters, also motivate a (hopefully) similar mechanism for overriding external resource specifications.
Orthogonal issues and considerations in configuring UIMA pipelines
Where to put this information
There is a continuum of places, ranging from least dynamic (most training required) to most dynamic (least training required):
- code (least dynamic, need to understand the code and where to go to modify what you want, requires most training)
- UIMA descriptors (still quite complex; have some special GUI tools for editing them)
- Structured properties files (like Jar Manifests - multiple sections, each section having key-value pairs of strings)
- "properties" files (simple key-value pair strings)
- JVM command line "defines" parameters - individual key-value pairs
Experience shows that for doing "runs", people feel that code and descriptors are complex and hard to comprehend / change, and prefer simpler, more focussed ways to specify things for the run. It is possible to provide more than one of these approaches; if multiples were supported, then some conventional (no surprise) rule for which overrides which, is needed.
JMX
In addition to the above, JMX settings may be desired.
Using the JVM command line as a source of configuration information
Putting configuration into the command line ties a "run" of a pipeline to a "run" of a JVM. There are cases (e.g., running UIMA inside servlets inside a web application container) where mutliple, independent instances of UIMA pipelines may independently start, run, and terminate - all without taking the JVM up and down. This argues for an approach not tied to the JVM command line.
On the other hand, the command line approach is very handy for quick augmentation / overriding of particular runs, where starting/stopping the JVM is an option.
Note that JMX settings could be arranged to be either global, or per "UIMA Context";
Encapsulation versus reaching down inside trees of nested aggregates
The original UIMA design attempts to support encapsulation in an aggregate, for parameter overrides. An aggregate may override parameters that its delegates declare. An aggregate can choose which of these it, in turn, is willing to allow a containing aggregate to be able to override; it can choose to "shield" some parameters, making them incapable of being overridden.
This has complicated the practice of designing large complex nested trees of annotators - in requiring aggregates to expose upwards parameters that the top level may want to override.
An alternative mechanism is wanted in these use cases, to allow "reaching down" from a top level into lower levels, without needing all the intervening levels of aggregation to expose individual parameters. However, some degree of control over this is also desired.
A suggested approach is to augment the configuration parameter definition with an additional property - a "global-name" which, if specified, would enable this reaching down, by having at the top level a key-value pair specification, where the key would be the global-name.
Non-path specification of the global name
A use case is to be able to use parameter specifications for different sub-parts of a big descriptor tree, or for the entire tree, without editing the key name. So - the key-name at the top should not include the path (down the nested hierarchy of aggregates); this allows its reuse even if the hierarchy changes.
Configuration settings - arrays
UIMA supports array-valued settings for configuration parameters. In key-value pair formats, some approach is needed for these.
- Multiple keys: having the same key name repeat, indicating multiple values.
- Multiple keys with conventional suffix (e.g., foo.1, foo.2, foo.3): this could be done, but introduces more opportunities for silly user errors (e.g., origin 0 or 1, etc).
- Single key with special syntax for multiple values - e.g., blank or comma separated, with escaping char ("\"?)
For this last alternative, we could introduce JSON-like notation, e.g., the square brackets surrounding the values list. This notation could obviate the need for continuation characters for multi-line value specifications - the closing square bracket would identify the end. Note that in our use case, the setting we're overriding already is declaring if the value is an array or not, so we don't require a special notation to indicate this in the value.
Configuration settings - Arbitrary Java Objects
UIMA's external resources support configuration settings whose values are arbitrary Java objects. To override these externally, it would be useful to be able to specify the implementation for the Java Object, together with a map of key - string value pairs representing initialization parameters. To support this, some kind of notation could be used for maps. For instance, we could adopt the JSON style - where curly braces would surround nested key - value string pairs, and a reserved key (e.g., com.apache.uima.externalResource.implementationClass) could indicate the implementation class name.
Another possibility is to use indenting - Things indented would be sub-key-value strings representing this. This covers the main use case with a very simple scheme. After thinking about this, there seems to be ambiguity possibilities that would require that at read time you know what the "type" of the value being read is (that is, you know if you are reading one of these map-like objects). This would be a problem if there is no "match" between the key and the externalOverrideNames.
Other kinds of configuration settings
Other kinds of settings for a UIMA pipeline have been attached to the JVM lifecycle by being specified as -D JVM parameters. Examples of these are the logging properties, UIMA-AS settings for controlling monitoring, UIMA-AS CAS logging, etc.
For consistency, these should have alternatives which are tied to a particular UIMA instance running (for example) as one of many within a container JVM (such as would be the case for multiple servlets, running in a web container).
To allow incorporating other global settings currently specified as -D parameters on the JVM command line, some keys are reserved. These correspond to the names currently used in the -D parameters.
This allows the -D to still be used, but also allows these values to be specified in other ways, i.e., within a top level descriptor.
Computed values via concatenation
A typical use case is to have some parameters be directory paths. In a particular use, several of these may need to have a common root.
These could be written:
param1 : /commonRootString/commonPart2/a1
param2 : /commonRootString/commonPart2/a2
...
or some concatenation could be used:
r : /commonRootString/commonPart2
param1 : ${r}/a1
param2 : ${r}/a2
This is a design trade-off - to support a concatenation-style factoring-out of common parts in the values part of the specification, or not.
Leaving it out in favor of simplicity may make sense, given that today's editors make it very easy to do global changes, and the human eye seems to be OK with seeing spelled-out patterns of repetition.
But if "correct" operation requires that some parts of the configuration specification have exactly the same value, then supporting this kind of thing allows expressing that constraint, and could reduce configuration errors.
Reusable multiple sets of settings
Users want to have settings for some subset of big pipelines, available as separate files, so that these can be reused in other contexts, for instance, when the subset is run separately, or inserted into another pipeline.
Inherited settings
Most systems with lots of configuration settings (e.g., Hadoop, most windowing systems) end up with a capability to have nested hierarchies of setting specifications. This allows putting in a set of defaults for all the settings, in one place, and then specifying an override for just a few settings, in another (often much smaller)
file.
The Java Properties class supports this by supporting a chain of key-value maps, each one referring to another map to use if the key is not found in the map. We could use this to support this capability.
Tooling
Strikethrough | ||
---|---|---|
| ||
Tooling should support taking a UIMA pipe line spec and "resolving" what all the parameters and settings would be once all the overrides etc. are done. This should print out a specification, together with information where useful on where various settings came from (e.g. via what overrides). |
Parts of the framework should log (under the CONFIG level) the actual parameter settings, with where they came from.
Parameter Groups
The same externalOverrideNames(s) can be supplied for individual specifications or for specifications contained within parameter groups. It is expected that nothing special needs to be done to support parameter groups.
External Resources overriding
The current design for external resources has 3 parts: the resource specification, the resource use, and a binding tying these two together. The binding provides an indirection which allows a name mapping between names used in Java code (the key name used in the External Resource Use declaration) and the actual, perhaps shared External Resource Specification.
Overriding an External Resource for a particular UIMA pipeline run should override the External Resource Specification, keeping the sharing structure that might be present in the pipeline description.
Design Specification
Configuration Parameter
Change the configuration parameter declaration to optionally have an externalOverrideName:
Code Block |
---|
<configurationParameter> <name>[String]</name> <externalOverrideName>[String]</externalOverrideName> <!-- <<<<< New --> <description>[String]</description> <type>String|Integer|Float|Boolean</type> <multiValued>true|false</multiValued> <mandatory>true|false</mandatory> <overrides> <parameter>[String]</parameter> <parameter>[String]</parameter> ... </overrides> </configurationParameter> |
If present, it means this parameter is overridable from the top (if specified), using the global name as the key. The name must be a suitable key name for a Java Properties file key.
The assumption would be that the publisher of the annotator would not include externalOverrideName specification, but that the assembler, who is putting together multiple annotators, would insert these wherever they needed, with whatever uniqueness in the name, to satisfy the need to expose parameters at the top level, and to share settings (by using the same externalOverrideName value in multiple places).
External Resources Specification
Change this to optionally have an externalOverrideName, with the same behavior as above.
Code Block |
---|
<externalResource> <name>[String]</name> <externalOverrideName>[String]</externalOverrideName> <!-- <<<<< New --> <description>[String]</description> <fileResourceSpecifier> <fileUrl>[URL]</fileUrl> </fileResourceSpecifier> <implementationName>[String]</implementationName> </externalResource> |
Support simple factoring and concatenation
To permit encoding knowledge that some (parts of) specifications must match others, allow values to include the form ${id-string} to be substituted by looking up id-string as a key, and then concatenating its value with any surrounding string value. For example, if rootDir had the value /a/b/c, then ${rootDir}/file would resolve to /a/b/c/file.
The id-string would need to be a name suitable as a key name. The value of id-string could be anything; however, its substituted value would not be re-scanned for recursive substitution (because the goal here is a simple, clear abbreviation, not a complex programming language construct). Note that the value of the id-string may itself require scanning and substitution, but the scanning of the result string continues after the substitution point. Loops may not be detected.
Syntax of key-value support
We follow a somewhat modified / augmented syntax of Java properties files; alternatively, JSON files can be used. JSON file usage is detected by reading an initial '{'.
Properties Files style
In our implementation, we use the UTF-8 codepage (unlike the spec for official Java Properties files).
For simple key value pairs, we follow the spec in Java Properties files:
- blank lines are ignored
- comment lines start with '#' or '!'
- keys specified without any value get a value of the empty string (not null)
- keys and values may be separated by blank(s), and optionally '=' or ':'
- values end at the last character before a new line (unless the line is "continued")
- blanks before and after the key (and the optional '=' or ':' are ignored. Blanks after the start of the value are significant.
- the escape character is '\'. All characters can be escaped.
- continuation lines are indicated by an escaped new-line
- blanks on the continuation lines up to the first non-blank character are ignored. An escaped blank is not ignored.
- whitespace includes all characters passing Character.isWhitespace()
- keynames must pass the Character.isJavaIdentiferStart / isJavaIdentifierPart
- Java supplementary characters are not supported
Special extensions to the syntax support arrays and nested maps (1 level nested, only).
Arrays
These values are not self-describing; rather, the fact that this parameter is an array is inferred from the spec of the parameter being overridden.
If the value is required to be an array, it is specified as a blank or comma-separated list. Blank or comma as part of a value can be included using the escape character. We also support a JSON - like alternative notation: an initial character '[' followed by values separated by blanks or commas, possibly over multiple lines (line-ends in this case are ignored, as well is blanks on the following line up to the first non-blank character), followed by ']' with the rest of the line ignored (if not whitespace - a warning is given). Escaped new lines are treated as value continuation in the same manner as Properties files (e.g., initial blanks on the following line are ignored).
Div | ||
---|---|---|
| ||
The main difference with JSON: strings do not need to be quoted, space and newlines can serve as separators, in addition to commas. |
Maps (for External Resources overriding)
A key which is overriding an external resource is required to have a map of key-value strings as its value. This is represented by starting with the brace '{' character, followed by key-value strings in the normal syntax for these (except that nested ones are not supported, and cause an error to be signaled), followed by a closing brace '}'. Within the braces, unescaped new lines signal new key value pairs; escaped new lines allow continuation following the same style as in Properties files. The closing brace may be included in the value by escaping it.
Attaching key-value pair information to top level UIMA descriptors
Multiple methods are supported.
Within the top level descriptor
The top level descriptor has already the xml:
Code Block |
---|
<operationalProperties> <modifiesCas> true|false </modifiesCas> <multipleDeploymentAllowed> true|false </multipleDeploymentAllowed> <outputsNewCASes> true|false </outputsNewCASes> <externalOverrideSettings> <!-- <<<< NEW (optional) element --> <import (by name or by value, like all other imports) /> and/or <settings> <!-- inline --> name value name value etc. </settings> </externalOverrideSettings> </operationalProperties> |
The import identifies a file to use. Multiple imports indicate multiple files. The order is the first one is the default; later ones override earlier ones.
The externalOverrideSettings element is ignored if it is not at the top level.
From the command line
There are 2 things that can be specified in the command line.
- A comma or blank separated list of paths, either in the file system or in the classpath, to properties files, where later paths in the list override the earlier ones.
- One or more -D specifications for the parameter "UIMAexternalOverrides", whose value is a key-value pair, using normal Java command line syntax for -D parameters.