Ev2 Representation of Moves
Summary
Ev2 was designed with an intention to support “move” semantics, but the current design of its move() method does not satisfy the requirements. Here I explain why, and present a solution.
The following principles constrain or guide Ev2 design:
- An edit must be able to express any change that can be constructed in a WC and committed.
- An edit must be able to express any change that can be constructed in a repo across multiple revisions.
- Ev2 uses sequential description: a succession of incremental edits applied to an initial state gives a final state.
- Each node should be changed only once.
Given these constraints, not all combinations of moves can be expressed using a “move source to destination” operation, with or without a “rotate” operation, without using temporary paths.
Moves can be expressed, in general, using separate move-away and move-here operations. The shape of this design is:
- The “move away” end of a move is (or can be) a separate operation from the corresponding “move here”.
- While a subtree is “in limbo” after being “moved away” from its source location and before being “moved here” at its destination, the behaviour is as if it has been moved to a temporary path. The temporary path is in a different name-space so that it cannot conflict with any real path, even a real path outside the scope of the edit.
Why is This Necessary?
Express Any Change
Since the purpose of the editor is to communicate a change originating in a WC when it is committed, and a change originating in a repository when the WC is updated, then it must be able to express any such change. This includes updates across multiple revisions, and from a mixed-revision state to a single revision.
Through a series of simple steps of the form “move X to Y”, some quite funky overall changes can be created. For example, starting from this state*:
| +-- A | +-- B
the following sequence:
move /A/B /X move /A /X/B move /X /A
results in swapping the nodes at paths A and B:
| | +-- A mv--\ /--> +-- A | X | +-- B mv--/ \--> +-- B
If those steps are performed in a WC and the result committed all at once, the editor needs to be able to handle it. If those steps are committed separately, and then a working copy is updated, the editor needs to be able to handle it.
More examples are given later, some of which involve other operations such as “make directory” as well as moves. All of this emergent complexity results from the introduction of a simple “move” primitive, and there does not seem to be any acceptable way to further constrain the basic move that would simplify the derived complexity.
- Notation: In the diagrams, a capital letter represents the name of a node; the node's identity (its node-copy-id) is not shown. In sequential instructions of the form “move /A/B /X”, the “/A/B” represents the source path of a node in the tree state that is being incrementally modified, and “/X” represents the destination path that it will have in the new state after this instruction.
Temporary Paths
Paths in the Ev2 editor operations refer to the current state of the tree being edited, at that moment in the sequence of edit operations. (The sole exception is the copy-from path, which is relative to a given committed revision.)
A natural and compact way of expressing moves would be as a mapping from original-state paths to final-state paths. However, that paradigm does not fit well with the desire to express the edit as a sequence of incremental steps. If we are going to include move functionality as steps in a sequence of edit operations, it makes sense to use paths that are relative to the current state.
Ev2 should not require the driver to express a change as a sequence of operations that can include moving a node to a temporary path and then later moving it again to the final path. The end result of the edit is the important part, and there are unlimited potential sequences that lead to that result, and it does not make sense to require an edit driver to construct such a sequence arbitrarily, if there is an alternative method that specifies the result uniquely. The receiver may in fact need to employ temporary paths in its implementation, but then it knows this and it is in a better position to construct such paths when needed, and it will know that they are temporary which may be important.
There are advantages to placing a node in its final position exactly once. It was claimed that Google's BigTable implementation of Svn's back-end would have benefited from knowing that once a node has been edited then it is in its final state. Ev2 aims to provide that guarantee, where Ev1 could not.
Sequential Description
Ev2 (svn_editor_t) intends to express a new state in terms of an old state and a description of the parts of the new state that differ from the old state, or, in other words, a description of the changes against the old state. It uses a sequential, incremental description: for example, “add directory X, then copy file X/Y from somewhere, then edit the contents and properties of file X/Y”.
Only certain parts of the description are incremental. Ev2 aims to allow nodes to be visited in arbitrary order, subject to a small number of restrictions. The parts where ordering matters are:
- tree changes before content changes
- alter/add/copy a directory (if required) before touching its (immediate) children
A Move Event is Not Adequate
Moves, in general, cannot be expressed as “move from path X to path Y” events in such a sequential description without introducing temporary paths. This is because some cases require the source path of the move to be moved away, then some other steps, and then the destination path of the move can be populated. Some classic examples are:
Example 1: Insert a directory level
| | +--A mv--\ (add) +--A \ | \--> +--B
The add cannot happen before the move-away, but must happen before the move-here.
Example 2: Swap two siblings
| | +--A mv--\ /--> +--A | X | +--B mv--/ \--> +--B
Neither of the moves can be completed before doing the move-away part of the other one.
Example 3: Swap two directory levels
| | +--A mv--\ /--> +--A | X | +--B mv--/ \--> +--B
Neither of the moves can be completed before doing the move-away part of the other one.
A Rotate Event is Not Adequate
At one time there was an idea that the addition of a “rotate PATH1 PATH2 … PATHn” event would complete the semantics and allow arbitrary moves to be supported.
While this does enable Example 2 (swap two siblings) and Example 3 (swap two directory levels) and many other cases, it does not help with inserting a directory level (Example 1), and it has been shown [1] to be incapable of resolving other more involved cases involving swapping or rotation. One specific example is swapping A with A/B/C [2]:
| | +-- A mv--\ /--> +-- A | \ / | +-- B mv--- X ---> +-- B | / \ | +-- C mv--/ \--> +-- C
[1] Email thread on dev@, “Ev2 as a move-aware editor”, started on 2013-06-24, e.g. <http://svn.haxx.se/dev/archive-2013-06/0560.shtml> or <http://mail-archives.apache.org/mod_mbox/subversion-dev/201306.mbox/%3C1372087321.76009.YahooMailNeo%40web186101.mail.ir2.yahoo.com%3E>.
[2] An example problem in thread [1], of swapping A with A/B/C: <http://svn.haxx.se/dev/archive-2013-06/0684.shtml> or <http://mail-archives.apache.org/mod_mbox/subversion-dev/201306.mbox/%3C87bo6rewwp.fsf%40ntlworld.com%3E>.
Move-Away and Move-Here
One solution is to describe the two halves of each move separately:
move-away SOURCE-PATH … move-here DESTINATION-PATH
We can then solve Example 1 in the following way: issue the “move-away A”, then create a new directory at path A which replaces that source of the move, and then finally issue the “move-here A/B” which relies on that replacement directory A having been created.
The consumer must be able to put the node aside for an indeterminate amount of time until the “move-here” is received.
Of course there needs to be a way to link each move-away with the corresponding move-here. Remembering that each edit step refers to the current state in a sequence of states, we cannot simply specify the path corresponding to the other end of the move like this:
move-away SOURCE-PATH to DESTINATION-PATH … move-here DESTINATION-PATH from SOURCE-PATH
because the problem cases are when the destination path does not yet exist at the time of a move-away, or the source path no longer exists at the time of a move-here. What we can do is use some other unique reference that is unique within the edit, like this:
move-away SOURCE-PATH as identifier ID1 … move-here DESTINATION-PATH from identifier ID1
The reference could perhaps be the destination path as it will finally exist at the end of the edit, or just an arbitrary number or string. We will just specify the identifier as an “id” and not specify how it is generated.
Explicit Direct Moves
We could have distinct operations for direct and indirect moves:
move SOURCE-PATH DESTINATION-PATH move-away SOURCE-PATH as identifier ID1 move-here DESTINATION-PATH from identifier ID1
In cases where the driver can issue a move(a,b) instead of a (functionally equivalent) pair of move-away immediately followed by move-here, then the receiver is likely to be able to process that single move more efficiently. So having the three methods available is better than just having the latter two.
- Brane wrote: [The inclusion of an explicit direct move] also makes validating the drive easier. Move away without a matching moved here (or the converse) is clearly invalid. It must be trivial for the receiver to detect that. Making the temporary locations explicit makes that so much easier. Regarding direct move without intermediate state, IMO the driver should be required to to use that whenever it can. Driver always has enough info to know that receiver can process such a move. If it cannot, that indicates a bug in the driver.
We should probably just leave it as a "quality of implementation" issue for the editor driver to prefer single move(a,b) instructions over pairs of move-away, move-here. We could try to come up with a requirement that it must do so in certain cases (starting with collapsing adjacent pairs of move-away, move-here), but I think this may be rather difficult to define fully, and of limited benefit.
Ordering Restrictions
The ordering rules regarding move-away and move-here should include:
- mv-away must come before the matching mv-here
- The edit should provide a sequential procedure that the consumer must be able to follow without having to buffer an arbitrary amount of state.
- mv-here & cp & add must be in nesting order: create (or put in place) the parent before its children
- mv-away must come before deleting a parent
- Receiver needs to know that it must preserve this path when we delete its parent.
- mv-away must come before mv-away of a parent
- If we allowed “mv-away A; …; mv-away A/B” then the child path “A/B” would have to be specified not relative to the current state of the edit, as all other operative paths are, but in some other way, because the parent has gone into temporary namespace, and has perhaps been replaced so that “A/B” now refers to some other node.
- There is a general rule that all edits within a moved directory “A” must come after A is moved its destination, but a mv-away of a subtree of A is not considered an edit for this purpose.
- ### Reconsider this rule. s/must/may/ ?
Examples Solved
Example 1: Insert a directory level
| | +--A mv--\ add--> +--A \ | \--> +--B
- alter-dir / (children={A})
- mv-away A (id=“original A”)
- add-directory A (children={B})
- mv-here A/B (id=“original A”)
Example 2: Swapping two siblings
| | +--A --\ /--> +--A | X | +--B --/ \--> +--B
- alter-dir / (children={A,B})
- mv-away A (id=“original A”)
- mv-away B (id=“original B”)
- mv-here A (id=“original B”)
- mv-here B (id=“original A”)
Example 3 can also be solved in this way, except for some ordering restriction issues that are discussed below.
Some further examples related to example 1.
Example 1b: Remove a directory level (by deletion)
| | +--A (del) /--> +--A | / +--B mv--/
The move-away must (in principle) happen before the delete, while the move-here cannot (in principle) happen before the delete. However, the Ev2 “add” and “copy” operations are defined to perform a simultaneous “delete” when replacing an existing node, and so the “move” operation could do the same and thus make this case trivial. Compare with example 1c.
Example 1c: Remove a directory level (by move-away)
| | +--A mv-->? /--> +--A | / +--B mv--/
Here there are two approaches. Either we move-away B, then move-away A, then move-here B (to path A); or we move-away A, then move-away B (from its temporary path inside A, which might currently be in limbo), then move-here B (to path A).
The Need for Alter-directory before Tree Changes
Why do we have this ordering rule?
If any path is added (with add_*) or deleted/moved/rotated, then an svn_editor_alter_directory() call must be made for its parent directory with the target/eventual set of children.
The original reason was to simplify the handling of “incomplete” directories in the WC. The root of the problem is the WC DB requirement that, before updating the DB entry for directory "DIR" to revision REV, there must be child entries in the DB corresponding to (at least) all the children of DIR@REV in the repository. Quoting Philip Martin [1], with this requirement in place, “The Ev2 receiver only ever has to handle incomplete directories that are empty and never has to handle incomplete directories with children. This may simplify the implementation. For example it may make it simpler and/or more efficient to update following an interrupted update.”
From that requirement we extended and derived a requirement for Ev2 that an alter-directory(children=X) must be issued before any calls that change the list of children. At first sight, that rule seems to fit well with the principle of the Once Rule, if we assume that a typical receiver (WC or repository) will treat a directory node as an object that contains a list of children, and will want to update it in one shot. However, it may not actually be as good a fit as it seems. A repository will probably need more information than simply a list of child names: it will want to know the corresponding node ids, and it won't know them until any copies, adds come in. So we should be careful about assuming this alter-dir is necessarily useful.
First let's address some simple problems in the wording above. Inside an add-directory, all the children have to be added, and that would imply we need an alter-directory as well as the add-directory, violating the Once Rule. It omits “copied” – just an oversight? It does not specify when the call must occur – presumably it must be before any such modifications to the children. To remedy those three initial problems, I suppose something like this is intended:
Either alter_directory() or add_directory() must be called on a directory, declaring its final set of children, before calling delete(), move_away(), move_here(), copy(), or add_*() on any child. (For delete() or move_away(), it must be alter_directory(), as the children of add_directory() must not be deleted or moved away.)
But there is still a problem. If we require alter_directory() before a move_away(), it leads to a violation of the Once Rule as shown in the following example.
Example 3: Swapping two directory levels
| | +--A --\ /--> +--A | X | +--B --/ \--> +--B
- alter-dir A (children={B})
- mv-away A/B (id=”original A/B”)
- alter-dir / (children={A})
- mv-away A (id=”original A”)
- mv-here A (id=”original A/B”)
- mv-here A/B (id=”original A”)
There is a potential problem here:
- We make an edit within subtree A (the “move-away A/B”) before moving A.
This violates the assumed constraint that we should disallow edits within a subtree before moving the subtree.
[1] Email “Re: Ev2 alter_directory deltas” to dev@ from Philip Martin on 2013-08-14, e.g. <http://svn.haxx.se/dev/archive-2013-08/0222.shtml>.
What If We Allow Edit Before Move?
We were assuming that we should disallow edits within a subtree before moving the subtree. [Why?]
One solution might be to drop that requirement and let the subtree be moved (move-away) part way through editing it, allowing all editing operations. To accommodate such a change, some of the other rules that currently refer to “a path” probably need to be reformulated to refer to “a path relative to such a subtree” instead.
If we allow edits before and after moving, should we also allow edits after the move-away and before the move-here? Not sure. It seems like that may be more problematic for certain consumer architectures and so probably should not be allowed. But is there a better way to decide?
What if the Once Rule is Per Node?
The path-based Once Rule was written something like this:
A path should never be referenced more than once by the add_*, alter_*, and delete operations (the "Once Rule"). The source path of a copy (and its children, if a directory) may be copied many times, and are otherwise subject to the Once Rule. The destination path of a copy [or move_here] may have alter_* operations applied, but not add_* or delete. If the destination path of a copy or move is a directory, then its children are subject to the Once Rule. The source path of a move_away() (and its child paths) may be replaced using add_*() or copy() or move_here() (where these new or copied nodes are subject to the Once Rule).
More comprehensively, in tabular form, the sequences of operations allowed on a path are:
Exists? | Op sequences |
|
|
|
no/yes | add-* | [create children] |
|
|
no/yes | copy(dst) | alter-* [optional] | [edit children] |
|
no/yes | move-here | alter-* [optional] | [edit children] |
|
yes | alter-* |
|
|
|
yes | delete |
|
|
|
yes | move-away |
|
|
|
yes | move-away | add-* | [create children] |
|
yes | move-away | copy(dst) | alter-* [optional] | [edit children] |
yes | move-away | move-here | alter-* [optional] | [edit children] |
Perhaps the Once Rule should not apply per path as it was stated, but rather per node. If a directory is altered and then moved away, we should be able to create a replacement directory, being a different node at the same path, and then alter that.
The Once Rule, Per Node
Let's try to define the Once Rule as applying per “node”, and see how far we can get.
- The Once Rule applies per node, rather than per path. The definition of a “node” is such that “move” moves an existing node (with any children) to a new path and does not create a new node, while “add” creates a new node and “copy” creates a new node (with any children), each new node being different from all other nodes that are or were in the tree even if it replaces an existing node at the same path.
- One of the following actions can be applied, just Once, directly to each node:
- create (only if it does not exist in the initial tree)
- remove (only if it exists in the initial tree or is brought in as a child of a copy)
- modify (only if it exists in the initial tree or is brought in as a child of a copy)
- A node may be created by one of:
- add_*()
- copy(), optionally followed by alter_*()
- Its children (recursively) come with it, and are then subject to the Once Rule as “child of a copy” nodes.
- A node (with any children) may be removed by one of:
- delete()
- add_*() replacing this node
- copy() replacing this node
- move_here() replacing this node
When removing a directory, each child (recursively) is considered to be removed at this time as well, and, as such, must not have been touched by any other operation. However, a previous child node may have been moved away [or deleted?] before this deletion.
[? The driver should not delete a node and then delete an ancestor: instead, it should just delete the ancestor.]
- A node may be modified by either or both of the following, in either order:
- move_away() followed by move_here()
- If the node is a directory, its children (recusively) move with it.
- Only if it exists in the initial tree: not allowed for a child of a copy.
- No operation may touch this node or any children (recursively) between move-away and move-here.
- alter_*()
- move_away() followed by move_here()
- alter_directory() is required before “editing” any children
- ### Not sure exactly what we mean here.
- ### The rules would be simpler without this requirement.
- The source of a copy operation may be a node in the initial tree being edited. Such a node (and its children, if a directory) may be copied many times, in addition to being subject to the Once Rule as existing nodes.
In tabular form, the sequences of operations that would be allowed on a node in this scheme are:
Exists? | Op sequences |
|
|
|
|
no | add-* | [create children] |
|
|
|
no | copy(dst) | alter-* [optional] | [edit children] |
|
|
yes | alter-* | [edit children] |
|
|
|
yes | [move-away children] | delete |
|
|
|
yes | [move-away children] | replaced by add-* |
|
|
|
yes | [move-away children] | replaced by copy() |
|
|
|
yes | [move-away children] | replaced by move-here() |
|
|
|
yes | move-away | move-here | alter-* [optional] | [edit children] |
|
yes | alter-* | [edit children] | move-away | move-here | [edit children] |
But there is a problem. While this per-node definition works for commit editor, the update editor works on WC paths, not on nodes.
Pathwise Edit Drive in an Update
During an update, Subversion runs the editor over the WC paths. Every WC path is not guaranteed to have a distinct node-copy-id, because of (a) switched WC paths and (b) mixed-revision WC paths. Therefore we cannot require distinct node-copy-ids in the edit.
Conclusion
Further work is needed to clarify the meaning of the Once Rule, and the ordering restrictions, especially the requirement about calling alter-dir. Some other aspects of Ev2 rules are still unclear.
Further Work
Alternative Approaches
The following approaches are just theoretical possibilities that could be explored, noted here for completeness and for contrast. There is no suggestion that they are good or even possible alternatives.
1.
An alternative design could perhaps be based on the following: Have a single “move” instruction, and allow it to be issued before there is a parent path ready to accept the destination. The destination's parent directory might need to be created (or replaced), and the destination path itself might need to be freed by deleting or moving away a node that exists with the same name. The destination path might be expressed as a final path, or as a node id and a child name, and the receiver would have to figure out when the destination's parent directory was ready. In other words, rather than have the driver specify the timing of the move-away and move-here, the receiver would choose when to perform the move-here side of the move. How would the receiver know when the destination parent directory is “ready”?
2.
We could issue a complete mapping of initial paths to final paths (or equivalent) at the beginning of the edit. This is quite a natural way of expressing moves, but does not seem a good fit for an editor that is basically sequential.
Half-Moves at the Edge of the Subtree
If an edit is limited to a subtree of the repository, and if we assume this limit is pathwise, then it is possible for a node to move into or out of this subtree during the edit. There are two ways to handle this:
- Driver describes it as an add or delete.
- Driver describes it as a move (or rather as half of a move).
Advantages to describing it as a move include:
- The consumer can then choose whether to warn the user, raise a tree conflict, or even perhaps commit or update another subtree in order to complete the move.
The consumer can convert a move-out to a delete trivially.
The consumer can convert a move-in to an add (of sorts) by requesting the node's content from the server, which is trivial during a commit (when the consumer is the server).