Preface

This page is not meant as a fundamental critique of Subversion's merge implementation but rather tries to approach the problem from a radically different angle. Many of the algorithms used today may turn out to be quite reasonable albeit derived from a simplified model.

Moreover, the current content of this page is more of an unsorted brain dump that will need further structuring. It may make attempts on addressing individual issues but will not present a coherent, formal model. More than anything, this aims at creating a deeper insight into the nature of the problem.

Core Issue: Impedance Mismatch

Content vs. container based operations.

User's view on merge

The user changes some text and wants to merge that change to some other development line.

To be precise, the user changes the content of some document and wants the tool to make an equivalent change to some similar or related document.
[
Footnote: Subversion's model of "a branch is just a copy" is fully consistent with that model. If there has been a reason to modify some text section, it is likely that the same reasoning will apply to any copy of that section. If Subversion supported additional semantic on copies (split, join, tag, one-way, ...), copies could be a more powerful concept than just branches.
] The problem here is that most tools have a very limited understanding of each of the 5 highlighted points.

Change. More than just a diff, a change has a scope and an intend. That intend will then result in a diff. Most tools don't understand intends like "replace all occurrences of X with Y" or "move block B to position P" etc. In many cases, it should be possible to deduce the intent and to represent the change as a set of operations of pre-defined types. This is also linked to the next item:

Content. As opposed to structure (e.g. textual order, file mapping) and formatting (e.g. white spaces), only this subset of a document's byte stream is relevant to a document's formal validity. The other two aspects are usually of lesser importance. A generic tool may or may not be able to differentiate between these three aspects. When it can, several classes of conflicts can be resolved automatically.

Document. A conceptual unit defined by the user's application. It is often not identical with a file for set of files. Files and folders are containers that documents get mapped to. For instance, the source code of a C library may be a document. Its mapping to .c and .h files is mainly convention and may change over time but that does not affect the document's content. A merge tool must operate on documents, not individual files.

Equivalent. Having the same effect on the target document than on the source. Some changes may not be necessary (e.g. the affected code section does no longer exist). Others may be extended (e.g. replacing an member identifier within a function that got many lines added on the target side). In any case, the changes must be translated to the target.

Similar or Related. Merging between related documents is simpler because the full change history of both sides is known. An ideal tool that understood the intend of a change should be able to apply it just as well to any sufficiently similar document without requiring any relationship with the source.

Issues with Subversion's approach

Subversions is an excellent version control system and a solid basis for configuration management. However, it only manages the versions of data containers (files and folders) with virtually no understanding of their content. That limits its merge capabilities to container-level operations. This is where the impedance mismatch lies: using container-level operations to implement content-level use-cases.

Moreover, merge tracking information is stored with the (resulting) data. Instead, it should be an attribute of the change, i.e. should be stored with the revision as detail information to the various changes in that revision.

Other tools

[based on hear-say, details may be wrong]

GIT accidentally got the document vs. file part less wrong by loosely identifying files via content rather than name, i.e. the actual file change history is of lesser importance.

ClearCase (and to some degree GIT) will merge complete branch histories and on branch level, i.e. it always merges the "whole document" and can resolve structural changes easier with a lower risk of creating conflicts in the future. E.g. moves cannot be merged partially.

Practical conclusions

Even without an in-depth analysis and attempt on modeling a perfect merge scheme,

A use-case

Subversion should allow for large-scale refactorings to be performed on some branch and then be merged successfully to other branches and the main development line. None of that shall unduly disrupt the anybodies development.

Annotated copies and deletions (document structure)

Introduce the concept of split and join for files and folders. In the first case, changes must be promoted to exactly one of the copy targets. Similarly, changes to any of the sources of a join will be applied to the its target.

These are typical operations when refactoring a data model (classes, modules etc.) and generalizes the concept of rename tracking.

It may also be useful to combine split and join in a single operation, e.g. 3 -> 2 files.

By extension, text blocks moved from one file to another should be detected.

Branch and merge directions

Copies should have two boolean flags: merge-from-source and merge-to-source, both being set by default. They specify the default change flow. For instance, a stabilization / release branch would only have "merge-from-source" set while tags would not set any of the flags.

With that, users can "pull in" any outstanding changes from all branches (or push changes to branches). That is an interesting feature for GUI clients.

Attempts to merge without the respective flag being set will require a "--force" parameter.

Change hierarchy

Separate the change information into

  • structural changes (text moves, splits, joins etc.)
  • textual changes
  • whitespace changes (indentation, line breaks)

Use the first to translate change positions, then apply textual changes and finally whitespace changes.

Conflicts will be resolved in the same order with the respective next step being adjusted to the output of the previous one. E.g. if the indentation of 4 lines on the target side got changed but the incoming text change replaces them with 3 lines, the result will change the indentation of those 3 lines - without creating a conflict.

On the non-importance on perfect automatic merges

Status quo

Despite the advertisement, SVN in its default configuration does not guarantee the consistency of a set of files after a commit. Disjoint sub-sets of files may be modified and committed concurrently without any consistency check on the whole file set.

This is a reasonable trade-off between workflow restrictions imposed by the tool itself and those defined my the development team / process. Most organizations will use automatic builds and tests to verify that the repository content is consistent. As long a build breakage is transparent and infrequent, the overall productivity is much higher than with an enforced, fully serialized update - build & test - commit cycle.

Potential for improved usability

Subversion should be able to reconcile more changes on the server, i.e. without forcing the user to update.

Directory property changes, for instance, should be accepted without a full tree update as long as there was no other change to those props. That will improve the merge tracking user experience in larger projects.

Impact on merge

Automatic merge may be more aggressive on resolving conflicts as long as "questionable" decisions are being documented by e.g. warnings. Most failed merges / merge artifacts will manifest in either build or test failures.

In case of an undetected merge-induced problem, it will be hard to distinguish that from similar problems caused by concurrent changes on the same development line. So, even aggressive merge conflict resolution strategies don't create the need for an extra QA because the same is already needed due to the fact of concurrent development and its integration needs.

Semantic ambiguities

TBD. Keywords:

  • Merge order
  • re-applying changes
  • affect on future merges from other sources
  • Maybe solution: Content = sum of its changes.
  • No labels