If you use a distributed revision control(DRC) such as Mercurial or Git, you are probably familiar with the term changeset:
Changesets are a way to group a number of modifications that are relevant to each other in one atomic package, that can be canceled or propagated as needed

The function of a revision control is to store the history of objects. Basically, the managed objects are files (and some of them consider directory as objects too). Those objects can be represented using two values:
- the structure: the path relative to the root of the project
- the content
So, when we move a file to another directory, we modify the first value. When we delete one object, we move it to “null” (similar to /dev/null on Unix systems). When we add/modify/remove a line in a text file, we modify the second state. And, when you add a new file to the revision control, you create a new object to track.
When one of this value changes (or both), the revision control stores a new revision of the object. Each system manages these changes in its own way but, it is globally how it works. Between two “commits”, the RC creates a changeset which is the group of modifications done on managed objects.
Merges: the nightmare of developers
The first kind of changes are not hard to manage but the second one is the nightmare of developers, especially if your team doesn’t commit frequently. It’s the main reason why we advise developers to split up their work in small changesets and to push/pull the changes many times a day.
But sometimes, we can’t avoid it: we have to merge content modifications. In order to help us, tools have two types of merge:
There are two primary types of merges performed by automated merge tools: 2-way merge, and 3-way merge. A 3-way merge is a more powerful and reliable method of merging than is afforded by the 2-way merge.
Two-way merge
A two-way merge performs an automated difference analysis between a file ‘A’ and a file ‘B’. This method considers the differences between the two files alone to conduct the merge and makes a “best-guess” analysis to generate the resulting merge. Consequently, this type of merge is usually the most error prone and requires user intervention to verify and sometimes correct the result of the merge prior to completing the merge event.
Three-way merge
A three-way merge is performed after an automated difference analysis between a file ‘A’ and a file ‘B’ while also considering the origin, or parent, of both files (usually the parent is the same for both). This type of merge is generally only available through the use of supporting revision control systems where such a parent would normally exist. The merge tool examines the differences and patterns appearing in the changes between both files as well as the parent, building a relationship model to generate a merge of files ‘A’, ‘B’, and the parent, ‘C’ to produce a new revision, ‘D’.
This merge is the most reliable and has performed well in practice. It has also required the least amount of user intervention, and in many cases, requiring no intervention at all (depending upon the complexity of the merge) making the process available for task automation.
It is what we use today to merge text files(binary files need a specific tool to be merged). The three-way solution is used by default with Mercurial and it’s true: in many cases, it doesn’t require an intervention. The developer has to make a manual merge only when a conflict appears. As it is described in the documentation of diff3, the definition of a conflict is:
diff3 mine older yours
You can remember the order of the arguments by noting that they are in alphabetical order.You can think of this as subtracting older from yours and adding the result to mine, or as merging into mine the changes that would turn older into yours. This merging is well-defined as long as mine and older match in the neighborhood of each such change. This fails to be true when all three input files differ or when only older differs; we call this a conflict. When all three input files differ, we call the conflict an overlap.
So, this solution is good and mature to merge content changes. But does diff really detect modifications of code? To be more concrete, these two samples are the same:
public class C{ public void method(){}; }
public class C { public void method(){}; }
But when we compare them with diff tools, we detect a change (even if the “Ignore White space where applicable” option is enabled):

Eclipse compare
Yes, it’s a simple case but it shows what I want to highlight here: the content of the file changes but the code don’t! And with current tools, it is detected as a modification.
Understanding the structural changes in the source code
As you know, a builder creates a tree of your code named an abstract syntax tree (AST). If you are an Eclipse user, you can view this tree with the AST Viewer. I just give you a simplified sample of a graphical representation.

Credit: Institut für Informatik III Rheinische Friedrich-Wilhelms-Universität Bonn
So, why don’t use the AST to compare files? If the tree doesn’t change, no source code modification is detected: my previous sample with class C is solved. But, when it changes, it can be hard to detect what the modifications on the AST are (comparisons of trees are complex by default
). I’ve found a paper from the Software Engineering Group of the Department of Informatics (IfI) at the University of Zurich. They currently develop a tool named ChangeDistiller.
The ChangeDistiller is an Eclipse Plugin to extract source code changes based on tree differencing. Subsequent abstract syntax tree revisions of Java files are converted into generic tree data structures and compared to compute the basic tree edit operations that transform an original into a modified tree. The tree edit operations and the source code information from the AST are used to classify the source code changes according to our taxonomy of source code changes.
I recommend you to read some of their papers, particularly this one: Change Distilling: Tree Differencing for Fine-Grained Source Code Change Extraction published in “IEEE Transactions in Software Engineering, 2007“. They explain how their algorithm works (I don’t want to copy/paste their research here). It is an improvement of a previous work of Chawathe et al. They have benchmarked their tool and the results are encouraging:
To validate our improvements, we established an extensive benchmark comprised of 1,064 manually classified changes. Compared to the original algorithm of Chawathe et al., we approximate the minimum conforming edit script with a mean absolute error of 1.64 and a mean absolute percentage error of 34 percent per version pair, that is, an improvement of 45 percent.
Translate it to: our algorithm is fallible, we still have errors so, it needs further improvements. And it is only on Java code at this time… So, we won’t replace our current diff tomorrow, but we can imagine that, someday, we will use this kind of tool.
I like this approach, and you, what is your opinion on AST comparisons?
Comments 4
To clarify, I assume by “error”, they mean a “suboptimal” difference rather than “incorrect? That is, they show a larger difference between the files than necessary?
There’s known approaches for visualising text diffs; I wonder how well AST diffs could be visualised?
Posted 30 Apr 2009 at 7:52 am ¶Hello Matt,
Posted 02 May 2009 at 1:47 pm ¶If I have correctly understood their experiment, they previously have compared files and they have identified the changes manually. Then, they have launched their algorithm and make a diff between what was expected and what their solution detects.
AST diffs is hard and, in my opinion, it is still a research project. If you are aware of MDA, you maybe know the Eclipse EMF project. One of its subprojects is EMF compare led by Obeo: it enables to manage the merge two metamodels. You can get an overview of the UI here (the presentation is available).
IEEE TSE Papers are great, but they aren’t real tools. See a Smart Diff tool that does AST compares for multiple languages at
http://www.semanticdesigns.com/Products/SmartDifferencer
Posted 14 Aug 2009 at 7:22 am ¶Hello Ira and thank you for your comment.
I wasn’t aware of this tool. I hope I’ll get time to test it soon. It seems pretty good! Thank you. I’ll try a short review on it.
I hope I see you again on my blog.
Coderfriendly
Posted 14 Aug 2009 at 5:01 pm ¶Trackbacks & Pingbacks 2
[...] bookmarks tagged friendly What are changes in the source code? | Coder-frien… saved by 7 others hihuli bookmarked on 04/27/09 | [...]
[...] What are changes in the source code? | Coder-friendly [...]
Post a Comment