The CasCopier supports deep copying of Feature Structures. Deep means that when a FS is copied, if it has references to other FSs, those are copied as well.
The copying support for copying all the FSs in one view to another view, in the same or different CAS.There is also support for copying individual FSs (the copyFS method).has multiple APIs, for different kinds of functions.
- Copying an entire cas (all views) into another CAS
- Copying just one view into another CAS, or into the same CAS (but a different view)
- Copying just one FSs from one CAS into another CAS
The general API involves creating an instance of the CasCopier class, specifying the source and destination CASs. This instance serves to remember when FSs are copied, to prevent them from being copied multiple times. So if a source CAS has multiple references to the same FS, the copies will all refer to the same copied version.
Some FSs have references to the Subject of Analysis (Sofa) data; these FSs may have "pointers" into that data. A common example is a Sofa which is a string of text, and instances of the built-in type "Annotation" which contain begin and end integers which refer to a substring (the covered text). Instances of these kinds of FSs are all subtypes of the built-in CAS type AnnotationBase.
When these FSs are copied, their sofa referenced information will be invalid unless they can be resolved against the sofa data that is equal to that of the original source view. This can be guaranteed in some situations:
- if the copying is between two different views of the same CAS - in this case the sofa reference can remain pointing to the original sofa, even though the copy is in another view.
- if the copying is to a different CAS, but the sofa data is successfully copied as well.
In other cases, the sofa reference will be invalid, in the sense that access to things like covered text will produce invalid results.
This is not currently properly accounted for in the current (3/2016) design.
Proposed fixes for the current design
The main idea is to have the sofa reference for FSs which are subtypes of AnnotationBase (abbreviated FSab) point to either a valid Sofa, or be "null" - not point to any Sofa.
- A code scan is needed to identify all uses of the AnnotationBase sofaRef field to insure:
- the right thing happens if it's null, or if it refers to a sofa not equal to the one for the current View
- A code scan is needed to identify all accesses to the Sofa in the current view, to insure:
- the sofa being obtained is the one wanted, and not by accident the one for a particular FSab.
- The cas copy code needs to be modified to adjust the sofa reference in the copy per this design
- The add-to-indexes check that complains if the sofa ref \!= view's sofa would need to allow for the different sofa ref in this case.
A secondary idea is which Sofa in the target is updated. The current design selects the one with the same sofa-number. This is a problem in general:
- the target may have differently numbered sofas, so this could pick an arbitrary sofa.numbered differently.
- For copying between two views of the same CAS, the target picked would be the same as the original, and if the sofa data was set already, an exception would be thrown, trying to set it to itself again.
A better idea may be to update the target view's sofa. This could require caller changes to insure that the target is the view to use; it could also cause backwards compatibility issues for user designs based on the sofa numbers.
...
reference information may need to be updated, if the sofa data changes. Sometimes, the sofa data won't change; for example, it won't change if a view is copied, along with the sofa data for that view. The APIs do support copying FSs, without copying the sofa data. In this case, it is up to the user to insure that any sofa references are updated appropriately for their application.
Current UIMA design tries to guarantee that the sofa reference feature for a FSab FSs with sofa references is always the same as the sofa associated with the CAS view used to create that FS. This is done at FS create the FSab, and that all FSabs in indexes in that view have that view's Sofa. This design fix will break that constrainttime - if a subtype of AnnotationBase is being created, its sofa feature is set to the creating caller's cas reference.
The current CasCopier design sets the sofa ref of the FSab to ref to a copy of the sofa in the target view. This copy may or may not have the same sofa data. The current design gets (or creates if not present) a sofa (not necessarily from the target view) which has the same "sofa number" integer ID as the original source sofa. For two arbitrary CASes, this is a poor design, since there could be no correspondence at all between these.
The new design:
...
Copying the Sofa Data and Mime Type
A view may or may not have an associated Sofa; the Sofa (if it exists) may or may not have its data "set".
The mime type is set if and only if the data is copied.
The source sofa can be copied in several situations:
...
- if it exists in the source
- create the target SofaFS if needed
- If the target SofaFS already exists and already has the sofa data set, throw an exception (can't set the sofa data once it's set)
...
view name (equal to sofaID) as the original source sofa