Background
This is a collection of ideas that have been discussed on the mailing lists and elsewhere, regarding configuring UIMA pipelines using an external configuration specification, associated with the top level of running a UIMA pipeline, in way that overrides the other conventional ways of configuring pipelines within the various annotator engine descriptors.
What running scenarios are contemplated?
The normal run scenario is one where you start up a pipeline, it initializes, and processes CASes and eventually finishes.
Alternatives: During the run, it is reconfigured. Parameters can be reconfigured by APIs; a similar kind of thing could be provided for later, if a need arises, to respecify the list of configuration settings files to reload.
What is configuration?
Configuration is a collection of things, set for a particular UIMA pipeline run. It can include things like conventional UIMA parameter settings, as well as other kinds of settings such as "placeholder" values that are substituted into UIMA deployment descriptors, debug flags, dump the CAS flags, the logging specification for the run, etc.
Configuring simple and complex values
Configuration parameters in UIMA have types like integer, float, string, boolean, or arrays of these. UIMA provides for arbitrarily complex Java Objects as configuration values, using the External Resource specification. The UIMA External Resources design also allows "sharing" of these complex objects among multiple annotators. Similarly, the normal configuration parameters allow "sharing" - that is, the UIMA parameter override design lets one parameter setting be connected to multiple parameters, down the nested hierarchy of annotators.
The same use cases motivating the setting of UIMA configuration parameters, also motivate a (hopefully) similar mechanism for overriding external resource specifications.
Orthogonal issues and considerations in configuring UIMA pipelines
Where to put this information
There is a continuum of places, ranging from least dynamic (most training required) to most dynamic (least training required):
- code (least dynamic, need to understand the code and where to go to modify what you want, requires most training)
- UIMA descriptors (still quite complex; have some special GUI tools for editing them)
- Structured properties files (like Jar Manifests - multiple sections, each section having key-value pairs of strings)
- "properties" files (simple key-value pair strings)
- JVM command line "defines" parameters - individual key-value pairs
Experience shows that for doing "runs", people feel that code and descriptors are complex and hard to comprehend / change, and prefer simpler, more focussed ways to specify things for the run. It is possible to provide more than one of these approaches; if multiples were supported, then some conventional (no surprise) rule for which overrides which, is needed.
JMX
In addition to the above, JMX settings may be desired.
Using the JVM command line as a source of configuration information
Putting configuration into the command line ties a "run" of a pipeline to a "run" of a JVM. There are cases (e.g., running UIMA inside servlets inside a web application container) where mutliple, independent instances of UIMA pipelines may independently start, run, and terminate - all without taking the JVM up and down. This argues for an approach not tied to the JVM command line.
On the other hand, the command line approach is very handy for quick augmentation / overriding of particular runs, where starting/stopping the JVM is an option.
Note that JMX settings could be arranged to be either global, or per "UIMA Context";
Encapsulation versus reaching down inside trees of nested aggregates
The original UIMA design attempts to support encapsulation in an aggregate, for parameter overrides. An aggregate may override parameters that its delegates declare. An aggregate can choose which of these it, in turn, is willing to allow a containing aggregate to be able to override; it can choose to "shield" some parameters, making them incapable of being overridden.
This has complicated the practice of designing large complex nested trees of annotators - in requiring aggregates to expose upwards parameters that the top level may want to override.
An alternative mechanism is wanted in these use cases, to allow "reaching down" from a top level into lower levels, without needing all the intervening levels of aggregation to expose individual parameters. However, some degree of control over this is also desired.
A suggested approach is to augment the configuration parameter definition with an additional property - a "global-name" which, if specified, would enable this reaching down, by having at the top level a key-value pair specification, where the key would be the global-name.
Non-path specification of the global name
A use case is to be able to use parameter specifications for different sub-parts of a big descriptor tree, or for the entire tree, without editing the key name. So - the key-name at the top should not include the path (down the nested hierarchy of aggregates); this allows its reuse even if the hierarchy changes.
Configuration settings - arrays
UIMA supports array-valued settings for configuration parameters. In key-value pair formats, some approach is needed for these.
- Multiple keys: having the same key name repeat, indicating multiple values.
- Multiple keys with conventional suffix (e.g., foo.1, foo.2, foo.3): this could be done, but introduces more opportunities for silly user errors (e.g., origin 0 or 1, etc).
- Single key with special syntax for multiple values - e.g., blank or comma separated, with escaping char ("\"?)
For this last alternative, we could introduce JSON-like notation, e.g., the square brackets surrounding the values list. This notation could obviate the need for continuation characters for multi-line value specifications - the closing square bracket would identify the end. Note that in our use case, the setting we're overriding already is declaring if the value is an array or not, so we don't require a special notation to indicate this in the value.
Configuration settings - Arbitrary Java Objects
UIMA's external resources support configuration settings whose values are arbitrary Java objects. To override these externally, it would be useful to be able to specify the implementation for the Java Object, together with a map of key - string value pairs representing initialization parameters. To support this, some kind of notation could be used for maps. For instance, we could adopt the JSON style - where curly braces would surround nested key - value string pairs, and a reserved key (e.g., com.apache.uima.externalResource.implementationClass) could indicate the implementation class name.
Another possibility is to use indenting - Things indented would be sub-key-value strings representing this. This covers the main use case with a very simple scheme. After thinking about this, there seems to be ambiguity possibilities that would require that at read time you know what the "type" of the value being read is (that is, you know if you are reading one of these map-like objects). This would be a problem if there is no "match" between the key and the externalOverrideNames.
Other kinds of configuration settings
Other kinds of settings for a UIMA pipeline have been attached to the JVM lifecycle by being specified as -D JVM parameters. Examples of these are the logging properties, UIMA-AS settings for controlling monitoring, UIMA-AS CAS logging, etc.
For consistency, these should have alternatives which are tied to a particular UIMA instance running (for example) as one of many within a container JVM (such as would be the case for multiple servlets, running in a web container).
To allow incorporating other global settings currently specified as -D parameters on the JVM command line, some keys are reserved. These correspond to the names currently used in the -D parameters.
This allows the -D to still be used, but also allows these values to be specified in other ways, i.e., within a top level descriptor.
Computed values via concatenation
A typical use case is to have some parameters be directory paths. In a particular use, several of these may need to have a common root.
These could be written:
param1 : /commonRootString/commonPart2/a1
param2 : /commonRootString/commonPart2/a2
...
or some concatenation could be used:
r : /commonRootString/commonPart2
param1 : ${r}/a1
param2 : ${r}/a2
This is a design trade-off - to support a concatenation-style factoring-out of common parts in the values part of the specification, or not.
Leaving it out in favor of simplicity may make sense, given that today's editors make it very easy to do global changes, and the human eye seems to be OK with seeing spelled-out patterns of repetition.
But if "correct" operation requires that some parts of the configuration specification have exactly the same value, then supporting this kind of thing allows expressing that constraint, and could reduce configuration errors.
Reusable multiple sets of settings
Users want to have settings for some subset of big pipelines, available as separate files, so that these can be reused in other contexts, for instance, when the subset is run separately, or inserted into another pipeline.
Inherited settings
Most systems with lots of configuration settings (e.g., Hadoop, most windowing systems) end up with a capability to have nested hierarchies of setting specifications. This allows putting in a set of defaults for all the settings, in one place, and then specifying an override for just a few settings, in another (often much smaller)
file.
The Java Properties class supports this by supporting a chain of key-value maps, each one referring to another map to use if the key is not found in the map. We could use this to support this capability.
Tooling
Tooling should support taking a UIMA pipe line spec and "resolving" what all the parameters and settings would be once all the overrides etc. are done. This should print out a specification, together with information where useful
on where various settings came from (e.g. via what overrides).
Parts of the framework should log (under the CONFIG level) the actual parameter settings, with where they came from.
Parameter Groups
The same externalOverrideNames(s) can be supplied for individual specifications or for specifications contained within parameter groups. It is expected that nothing special needs to be done to support parameter groups.
External Resources overriding
The current design for external resources has 3 parts: the resource specification, the resource use, and a binding tying these two together. The binding provides an indirection which allows a name mapping between names used in Java code (the key name used in the External Resource Use declaration) and the actual, perhaps shared External Resource Specification.
Overriding an External Resource for a particular UIMA pipeline run should override the External Resource Specification, keeping the sharing structure that might be present in the pipeline description.
Design Specification
Configuration Parameter
Change the configuration parameter declaration to optionally have an externalOverrideName:
<configurationParameter> <name>[String]</name> <externalOverrideName>[String]</externalOverrideName> <!-- <<<<< New --> <description>[String]</description> <type>String|Integer|Float|Boolean</type> <multiValued>true|false</multiValued> <mandatory>true|false</mandatory> <overrides> <parameter>[String]</parameter> <parameter>[String]</parameter> ... </overrides> </configurationParameter>
If present, it means this parameter is overridable from the top (if specified), using the global name as the key. The name must be a suitable key name for a Java Properties file key.
The assumption would be that the publisher of the annotator would not include externalOverrideName specification, but that the assembler, who is putting together multiple annotators, would insert these wherever they needed, with whatever uniqueness in the name, to satisfy the need to expose parameters at the top level, and to share settings (by using the same externalOverrideName value in multiple places).
External Resources Specification
Change this to optionally have an externalOverrideName, with the same behavior as above.
<externalResource> <name>[String]</name> <externalOverrideName>[String]</externalOverrideName> <!-- <<<<< New --> <description>[String]</description> <fileResourceSpecifier> <fileUrl>[URL]</fileUrl> </fileResourceSpecifier> <implementationName>[String]</implementationName> </externalResource>
Support simple factoring and concatenation
To permit encoding knowledge that some (parts of) specifications must match others, allow values to include the form ${id-string} to be substituted by looking up id-string as a key, and then concatenating its value with any surrounding string value. For example, if rootDir had the value /a/b/c, then ${rootDir}/file would resolve to /a/b/c/file.
The id-string would need to be a name suitable as a key name. The value of id-string could be anything; however, its substituted value would not be re-scanned for recursive substitution (because the goal here is a simple, clear abbreviation, not a complex programming language construct). Note that the value of the id-string may itself require scanning and substitution, but the scanning of the result string continues after the substitution point. Loops may not be detected.
Syntax of key-value support
We follow a somewhat modified / augmented syntax of Java properties files; alternatively, JSON files can be used. JSON file usage is detected by reading an initial '{'.
Properties Files style
In our implementation, we use the UTF-8 codepage (unlike the spec for official Java Properties files).
For simple key value pairs, we follow the spec in Java Properties files:
- blank lines are ignored
- comment lines start with '#' or '!'
- keys specified without any value get a value of the empty string (not null)
- keys and values may be separated by blank(s), and optionally '=' or ':'
- values end at the last character before a new line (unless the line is "continued")
- blanks before and after the key (and the optional '=' or ':' are ignored. Blanks after the start of the value are significant.
- the escape character is '\'. All characters can be escaped.
- continuation lines are indicated by an escaped new-line
- blanks on the continuation lines up to the first non-blank character are ignored. An escaped blank is not ignored.
- whitespace includes all characters passing Character.isWhitespace()
- keynames must pass the Character.isJavaIdentiferStart / isJavaIdentifierPart
- Java supplementary characters are not supported
Special extensions to the syntax support arrays and nested maps (1 level nested, only).
Arrays
These values are not self-describing; rather, the fact that this parameter is an array is inferred from the spec of the parameter being overridden.
If the value is required to be an array, it is specified as a blank or comma-separated list. Blank or comma as part of a value can be included using the escape character. We also support a JSON - like alternative notation: an initial character '[' followed by values separated by blanks or commas, possibly over multiple lines (line-ends in this case are ignored, as well is blanks on the following line up to the first non-blank character), followed by ']' with the rest of the line ignored (if not whitespace - a warning is given). Escaped new lines are treated as value continuation in the same manner as Properties files (e.g., initial blanks on the following line are ignored).
The main difference with JSON: strings do not need to be quoted, space and newlines can serve as separators, in addition to commas.
The main difference with Properties files: the value can be spread over multiple lines (up to the closing ']') without using escaped end of line characters.
Maps (for External Resources overriding)
A key which is overriding an external resource is required to have a map of key-value strings as its value. This is represented by starting with the brace '{' character, followed by key-value strings in the normal syntax for these (except that nested ones are not supported, and cause an error to be signaled), followed by a closing brace '}'. Within the braces, unescaped new lines signal new key value pairs; escaped new lines allow continuation following the same style as in Properties files. The closing brace may be included in the value by escaping it.
Attaching key-value pair information to top level UIMA descriptors
Multiple methods are supported.
Within the top level descriptor
The top level descriptor has already the xml:
<operationalProperties> <modifiesCas> true|false </modifiesCas> <multipleDeploymentAllowed> true|false </multipleDeploymentAllowed> <outputsNewCASes> true|false </outputsNewCASes> <externalOverrideSettings> <!-- <<<< NEW (optional) element --> <import (by name or by value, like all other imports) /> and/or <settings> <!-- inline --> name value name value etc. </settings> </externalOverrideSettings> </operationalProperties>
The import identifies a file to use. Multiple imports indicate multiple files. The order is the first one is the default; later ones override earlier ones.
The externalOverrideSettings element is ignored if it is not at the top level.
From the command line
There are 2 things that can be specified in the command line.
- A comma or blank separated list of paths, either in the file system or in the classpath, to properties files, where later paths in the list override the earlier ones.
- One or more -D specifications for the parameter "UIMAexternalOverrides", whose value is a key-value pair, using normal Java command line syntax for -D parameters.
14 Comments
Burn Lewis
External Overrides cannot be overridden
When an external override is attached to a parameter in an aggregate that overrides a parameter in one of its delegate, any value assigned to the external override will follow the standard UIMA rules and be assigned to the delegate's parameter ... UNLESS the delegate's parameter also has an external override with a value. The intent here is to ensure that defined external overrides at the level closest to the annotator are always honored. This bottom-up approach is the reverse of but complements the standard UIMA override mechanism, and fits the reaching-in, non-hierarchical intent of this mechanism.
Burn Lewis
Initial Implementation
I have made a first cut at this with a few changes:
In the top-level Descriptor I've wrapped the multiple import elements in "imports" and accept no more than 1 of each of the imports and settings elements. I've also reversed the priority as it seems more consistent with other usages to put the top priority first and the "defaults" below. The imports and in-line settings may be in either order and the order is honored. One consequence of the imports element is that inline settings cannot be inserted between imported files, but I don't believe that this a serious restriction since inlines are most likely to be used to override entries in all of the imported files.
On the command line -DUimaExternalOverrides=file_name will apply before any in the top-level descriptor.
Arrays are only comma-separated. No maps or ${key} support yet.
The Java.util.properties class is used to load the properties which means that within a file (or the inline settings) the last of duplicate entries overrides any earlier ones, unlike the highest-priority first rule for files. We could correct this when we add support for nested maps.
I promise to test with the CDE and other tools before publishing!
Burn Lewis
Initial Implementation Published
Now in uimaj trunk UIMA-2378
CDE does not yet expose the new fields but will preserve them.
Arrays only comma-separated, no maps, priority as above.
One level of ${key} evaluation supported:
Marshall Schor
It seems much more direct to avoid evaluation of
by writing
Would that work?
Burn Lewis
Only if you doubled the \ as the Properties class handles escapes.
Should be fixed when we replace with our own class.
Burn Lewis
File Format - JSON-like
Re-implementing the Java Properties class with a format extended to support UTF8 & arrays & maps seems unnecessary when the JSON format is available. The addition of arrays with elements separated by blanks or commas means that blanks in values must be escaped which detracts from their legibility. Escaping non-printable characters such blanks and line-ends can produce non-obvious errors. Rather than create a new additional syntax to support I suggest we use only JSON – with some of its quoting requirements relaxed for readability. In contrast with Java Properties, the typing provided by JSON would help prevent errors such as assigning the same variable to both a string and an array of strings.
We could simplify JSON by not requiring that the name be a quoted string, and that string values need only be quoted if they contain a reserved character ("[{,white-space), or look like a number or boolean or null, e.g.
We could also make the outer {} optional, and the comma at the end of a line optional.
Since {} define a map I suggest we use $(...) for variable evaluation; as discussed above, evaluation could be avoided by escaping the $.
Note that JSON does allow quoted strings to span multiple lines when the line-end is escaped!
File Format - Even simpler
On the other hand, implementing an extended Properties syntax would support a simpler syntax, if we treat all property values as strings or arrays of strings, and allow arrays and maps to span multiple lines and array elements to be delimited by commas or new-lines, we get:
Note the limited type information in the file, array or primitive, while the types of the elements are checked when used.
From the command line
About the two specifications above:
A comma or blank separated list of paths, either in the file system or in the classpath, to properties files, where later paths in the list override the earlier ones.
The initial implementation supports only a single file, e.g. -DUimaExternalOverrides=/home/burn/DQA/weekly.settings, and I could extend this to a comma-separated list of files, but how would we distinguish between a relative filesystem name and a classpath entry? I'd prefer to not use the classpath here ... we don't for the other -D options. Also as for system and class paths, I'd recommend that the first entry found overrides any in later files. (Note: we have 2 ways of naming -D options - starting with "uima." and with "Uima")
One or more -D specifications for the parameter "UIMAexternalOverrides", whose value is a key-value pair, using normal Java command line syntax for -D parameters.
This ability to specify individual values is similar to the existing envVarRef feature and circumvents the benefits of packaging all of an experiment's parameters in a small set of control files. Also we can process only one of these -D properties so its value would have to be a list of name-value pairs, or some subset of JSON ... perhaps we could defer this for now.
Burn Lewis
Imports by name
The UIMA XML element <import name="classfile"> converts periods to slashes and appends ".xml". Should we define an implicit file-type for our settings files? e.g. .settings .props .params .json .jon ... ?
Marshall Schor
I don't feel strongly, but slightly lean toward something that identifies the purpose of the file - like .uimasettings
Burn Lewis
Variable substitution
Since the value of a substituted variable may be a complex object such as an array or map, substitution should be applied before type coercion, e.g.
This means requiring the definition to occur before the use, or making a second pass over the map to evaluate the entries with forward references - tricky when the use is inside a nested map or array.
How should variables in nested maps be evaluated? i.e. what is the value of bar in:
Should we say variables must be defined at the top-level? Or that we follow a nearest-definition rule (which would produce a conversion error here)? I'd suggest top-level only with perhaps a warning about the inner foo.
Marshall Schor
I think this example mixes up two things. The first thing is the definition of overrides. These can be simple values, array values, or maps. The 2nd thing is the definition of substitutable $-style values, to be used within the first kind of thing (or perhaps, also within the 2nd thing).
Values for the 2nd kind of thing to me can only be simple values, not strings or arrays.
In the example above, the 2nd foo is not a simple definition, it is specifying a map entry whose key is foo and the value is two. So it doesn't qualify as something which defines a $-style substitutable value, in my opinion. For the above reasons, I agree with Burn's suggestion, but would define "top level" differently - in that it should be a "simple value" (in other words, array and map values are excluded). I don't think a warning is even needed in this case.
Burn Lewis
The 2nd issues arise if we use a JSON-like syntax since the top-level things are also a map. The answer may be to separate the substitution variables from the ones used in external overrides, perhaps by declaring in a separate section or by style, e.g. SUBVAR or _subvar_
Marshall Schor
In looking at the first part of this example, I think this is taking the substitution into a too-complex realm. I feel it would be simpler to say that only non-array, non-map "values" can be used as $-style substitutable values. If more complexity is required, perhaps some other mechanism (more closely aligned with standard programming paradigms) could be employed. I think the mechanisms envisioned here are meant to capture the broad need for collecting overriding parameters into one place, with one very simple $-style substitution.
So I would not allow PassageIndexPaths - an array value - to be used as a value in a $-style substitution.
Burn Lewis
Unfortunately this would remove one of the goals: sharing the same configuration parameter in multiple annotators.
Further discussions yielded: mark the property as an array, but let its contents be substituted. The substituted values would be strings with type coercion applied when assigned to a configuration parameter.
Burn Lewis
Top Level Descriptor Restriction
Restricting the specification of the override settings to only the top-level descriptor has proven to be sub-optimal. The problem shows up when a working aggregate with a settings file is combined with others and wrapped in an outer aggregate. It will no longer work unless its settings specifications are copied to the new top-level aggregate. Similarly for any of its peer aggregates ... and then the merged settings may have conflicting entries.
A proposed redesign is to support multiple sets of settings and let delegates inherit from their parents. A delegate's external overrides would be resolved by searching all settings found between the top-level descriptor and the delegate. Settings would still be shared and global with their ancestor chain, and aggregates could still override settings specified in their delegates. The -DUimaExternalOverrides entries would only apply to the top-level settings.