iDiff: Interaction-based Program Differencing Tool Hoan Anh Nguyen, Tung Thanh Nguyen, Hung Viet Nguyen, and Tien N Nguyen Electrical and Computer Engineering Department Iowa State University {hoan,tung,hungnv,tien}@iastate.edu Abstract—When a software system evolves, its program entities such as classes/methods are also changed System comprehension, maintenance, and other tasks require the detection of the changed entities between two versions However, existing differencing tools are file-based and cannot handle well the common cases in which the methods/classes are reordered/moved or even renamed/modified Moreover, many tools show the program changes at the text line level In this demo, we present iDiff, a program differencing tool that is able to display the changes to classes/methods between two versions and to track the corresponding classes/methods even they were reordered/moved/renamed and/or modified The key idea is that during software evolution, an entity could change its location, name, order, and even its internal implementation However, its interaction with other entities would be more stable iDiff represents a system at a version as an attributed graph, in which the nodes represent program entities, the edges represent the interactions between the nodes Entities between two versions are matched via an incremental matching algorithm, which takes into account the similarity of interactions for matching The differences of two versions of the entire system including its program entities are detected based on the matched entities I I NTRODUCTION A program always consists of logical entities, e.g classes, methods, data structures, etc, which are designed to provide system functionality As the program evolves, its entities are also changed (being renamed, moved, reordered, and/or modified) Those changes in the entire system are important for system evolution comprehension and other maintenance tasks However, the lack of documentation and the large number of program entities in a large-scale system make the manual recovery of system changes tedious and time-consuming Thus, it is desirable to have an automatic differencing tool supporting the detection of changed entities between two versions Unfortunately, many of existing program differencing tools are file-based and display the program changes at the level of lines of texts, disregarding the program entities in source files This creates a mental gap and burden for developers because during development, their mental process is around program entities, while in program differencing, the tools provide textual changes in individual files Thus, it is more beneficial if a program differencing tool could hide the details of the differences in terms of text lines and files/directories, and show the logical changes to program entities Achieving that goal is challenging because during software evolution, it often happens that the methods/classes are renamed/reordered, moved, and even changed in its internal implementation The existing tools find the corresponding entities based on their names and/or internal implementations However, in reality, there are several cases, as shown next, in which the entities might be changed both in name and implementation, making such matching a challenge II M OTIVATING E XAMPLE Figure shows an example in jHotDraw, a library for graphical drawing It has a feature to handle drag and drop between Components, implemented via class DragNDropTool In version 5.4b1, this class has a method setDragSourceActive This method is used to turn on/off the drag-and-drop states of the components stored in comps However, in the newer version 5.4b2, due to a change in the system’s design, a DragGestureListener is added to support the monitor of dragging-anddropping, replacing the role of the list comps and the partial implementation of setDragSourceActive Thus, comps is removed and setDragSourceActive is re-implemented Its functionality is now only for switching a boolean flag dragOn, thus, it is renamed to setDragOn (Figure 1) Two methods setDragSourceActive and setDragOn have different names and bodies in two versions However, they actually have the same role in the system as they are used in the same way in two other methods activate and deactivate In general, in software evolution, an entity could change its location, name, order, and even its internal implementation However, its usage interaction with other entities would be more stable Existing tools cannot match setDragSourceActive and setDragOn in two versions Traditional program differencing tools (e.g Unix and CVS’s diff tools [17], Ldiff [2]) focus more on textual differences of program files Other advanced differencing and refactoring recovery tools [1], [15], [5], [19], [4] look at semantic differences of two entities, however, they compare the entities’ internal implementations, and did not consider their usage interactions, i.e how they are used by others In contrast, origin analysis tools [18], [11], [9], [10], which aim to determine the original versions of entities before program changes, rely on name similarity during matching III I D IFF A PPROACH We propose iDiff, an interaction-centric approach that matches entities mainly based on their interactions Our philosophy is that, an entity generally is associated with a role/functionality in the system, and this role is quite stable as the system evolves The role/functionality of an entity is reflected via its interactions with other entities (e.g how it uses or is used by others) Thus, we can match entities between versions based on the similarity of their interactions 572 978-1-4577-1639-3/11/$26.00 c 2011 IEEE ASE 2011, Lawrence, KS, USA public class DragNDropTool extends AbstractTool { private List comps; public void activate() { super.activate(); setDragSourceActive(true); } public void deactivate() { setDragSourceActive(false); super.deactivate(); } private void setDragSourceActive(boolean newState) { Iterator it = comps.iterator(); while (it.hasNext()) { DNDInterface dndi = (DNDInterface)it.next(); dndi.setDragSourceActive(newState); }} Fig public class DragNDropTool extends AbstractTool { private DragGestureListener dragGestureListener; private boolean dragOn; public void activate() { super.activate(); //setDragSourceActive(true); setDragOn(true); } public void deactivate() { setDragOn(false); //setDragSourceActive(false); super.deactivate(); } protected void setDragOn(boolean isNewDragOn){ this.dragOn = isNewDragOn; } setDragSourceActive to setDragOn: method changes in both name and implementation iDiff represents a system at a version as an attributed, directed multigraph, in which nodes represent program entities, edges represent different types of interactions among nodes, and the attributes represent the information of corresponding entities, such as their names/signatures and other information iDiff is interested in two main groups of entities: classes and methods The attributes and relations/interactions of the entities reflect three main types of interactions in an objectoriented system: method invocation, data access, and inheritance For each entity, there are three types of attributes and relations/interactions: signature, inheritance and collaboration relations The signature of an entity e contains its name and interface information The attributes of inheritance relations refer to the other entities in the system that inherit or are inherited from e The attributes of collaboration relations refer to the entities that use e in their body or are used by e Thus, for each entity e, a set Ii (e) represents its interaction set, corresponding to one kind of interaction associated with e Differencing of program entities is then formulated based on the matching between the entities at two versions, which takes into account the similarity of interactions of the entities as the matching criteria Unmatched nodes are considered as the deleted/added entities Matched nodes with different attributes and/or neighborhood are considered as the modified entities Definition (Matched entities): An entity u in version M is matched to an entity u′ in version N of the system, denoted by u ≡ u′ , if their interaction similarity, denoted by sim(u, u′ ), is maximal and is sufficiently large, i.e sim(u, u′ ) → max and sim(u, u′ ) ≥ σ Definition (Interaction similarity): The interaction similarity between two entities in two versions is the weighted sum of the similarity of all and is ∑ of their interaction sets computed as sim(u, u′ ) = i αi ∗ ssim(Ii ∑ (u), Ii (u′ )) where αi s are the non-negative weights such that i αi = In this formula, each set Ii (u) is an interaction set of u It is a set of entities corresponding to one kind of interactions associated with u Entity u can have multiple interaction sets For example, if u is a class, its interaction sets are SuperClass(u), SubClass(u), Containee(u), Container(u) and Users(u) If m is a method, its interaction sets include Caller(m), Callee(m), Local(m), Overrider(m), and Overridee(m) Each of such interac- tions for u will be compared with the corresponding interaction set for u′ in the new version To compare two sets P and P ′ , we define ssim(P, P ′ ) as follows: Definition (Set similarity): The similarity between two sets of entities P and P ′ is the ratio between the number of their matched entities and the total number of their entities: ∗ |P ⊗ P ′ | ssim(P, P ′ ) = |P | + |P ′ | P ⊗ P ′ denotes the matched subset of P and P ′ , covering all matched entities in P and P ′ Definition means that P and P ′ will be more similar if they contain more matched entities Formally, the matched subset is defined as follows Definition (Matched subset): The matched subset of two sets of entities P and P ′ is the list of pairs of their matched entities, i.e P ⊗ P ′ = {(u ∈ P, u′ ∈ P ′ )|u ≡ u′ } To efficiently measure the similarity and match the entities, we use two divide-and-conquer heuristics First, the candidates are discovered based on interactions, i.e when two entities are matched, the entities related to them, such as the containing classes, superclasses, callers, and callees, will be the candidates for further matching Second, the entities are matched incrementally, i.e the already-matched entities are used to calculate/update the similarities on interactions of the remaining matching candidates We also use similarity measures in traditional origin analysis (similarity in name, code, and calling structure) to reduce the number of comparisons IV T OOL I MPLEMENTATION iDiff is implemented as a plug-in in Eclipse The two versions of the system to be compared are imported into the Eclipse workspace iDiff uses JavaModel and ASTParser in JDT plugins to parse a project to get all the type and method binding information When two versions are selected, iDiff compares them and displays the changes to classes/methods The icons next to classes/methods signify the modifications, additions, or deletions When an original method is selected, the corresponding one in the new version is highlighted V E VALUATION This section discusses the empirical evaluation of iDiff on real-world subject systems All experiments were carried out 573 TABLE I P RECISION OF E NTITY M ATCHING Version Pairs Matched 5.2-5.3 5.3-5.4b1 5.4b1-5.4b2 5.4b2-6.0b1 77 81 11 3,062 1.1.8-1.2.3 1.2.3-1.3.3 1.3.3-1.3.4 516 125 13 Checked jHotDraw 77 81 11 100 SVNKit 100 100 13 √ X TP 77 75 10 100 100% 93% 91% 100% 100 99 13 100% 99% 100% of pairs that Kim’s can detect but our tool cannot iDiff also achieves high precision (more than 95% most of the cases) Importantly, in many cases, iDiff’s results cover Kim’s B Time Efficiency Column Time in Table II shows the total running time of compared to Kim’s tool The result shows that iDiff is more scalable than Kim’s, and its running time is good It takes less than 30s to match two versions of jHotDraw (70 KLOCs) For the larger systems, SVNKit (150 KLOCs) and JFreeChart (200 KLOCs), running time is less than ten minutes iDiff, on a computer with CPU Intel Core Duo T6500 2.10 GHz, 4GB RAM, and Windows Vista A Accuracy To evaluate the quality of change detection in iDiff, we conducted two experiments First, we manually checked the results Second, we compared our results with those of Kim et al.’s API matching tool [11], which has been verified on a number of real-world systems In the first experiment, we executed iDiff on different version pairs of jHotDraw and SVNKit (see Table I) They was chosen due to the availability of source code and their rich sets of documentation Columns Matched and Checked show the numbers of method-level feature matches that were returned from iDiff and the √ ones that were manually checked respectively Columns and X display the correctly and incorrectly detected matches after checking TP (True Positive) shows the precision, which is the ratio between the number of correctly detected pairs over the total number of checked, detected pairs iDiff’s precision is very high with only a couple of incorrect pairs For pairs of versions (jHotDraw 5.4b1-5.4b2, SVNKit 1.3.3-1.3.4), there are only 11 and 13 matches, respectively For the pairs with many changes, we checked only 100 randomly-chosen matches In all cases, iDiff’s precision is higher than 90% and in most of the cases it reaches to 99% In the second experiment, we wanted to compare the performance of iDiff with that of the original analysis tool (Kim’s tool [11]) In this experiment, we used our tool to detect only the mapped methods and compare the results with those of Kim’s tool (That tool reported only at the method level) We executed both iDiff and Kim’s tool on several consecutive version pairs of jFreeChart, jHotDraw and SVNKit From the outputs of two tools, all method-level matches ∩ were compared to find the common set of matches (column ), and to identify a set of matches that were returned by iDiff but not by Kim’s (column iDiff-Kim), and a set of matches that were found by Kim’s but not by iDiff (column Kim-iDiff) Those differences were further manually checked to see if they are correct (column √ ), incorrect (column X) or undecidable (column ?) For each group, we also computed the number of correct pairs and incorrect pairs over the total number of pairs: column TP (True Positive) and FP (False Positive) respectively Table II shows the comparison results In most of the cases, iDiff produces many more matches than Kim’s tool The number of pairs that iDiff can detect but Kim’s cannot is more than the number VI R ELATED W ORK Program Differencing Traditional program differencing techniques (CVS’s diff [17], Ldiff [2]) are text-based in which they compare the lines of code without considering the program’s structure/semantics Reiss’ diff tool works on program tokens [16] Other advanced diff tools perform on program’s Abstract Syntax Trees [6], [13], [3], [14] iDiff also compares the interactions of classes/methods, thus is able to detect the complex changes such as renamed, moved, modified methods JDiff [1] builds enhanced control flow graphs and matches their nodes JDiff is still name/signature-based for entity matching Horwitz’s [8] and Raghavan et al.’s approaches [15] compare the corresponding program dependence graphs in two versions Comparing to iDiff, those tools compare the internal implementations, without considering how the methods/classes are used Thus, they cannot detect the cases of largely modified and renamed methods UMLDiff [20] reports design-level structural changes for class models reverse-engineered from Java code Mehra et al [12] develop a visual differencing algorithm for design diagrams in Pounamu tool In comparison, iDiff focuses on fine-grained code changes and interactions, rather than the design changes Origin Analysis Other related tools are origin analysis ones [18], [7], [11] The origin analysis tool from Kim et al [11] uses the similarity factors including methods’ names and signatures, caller/callee sets, textual contents, and complexity metrics Comparing to Kim’s tool, iDiff has several crucial advances First, while comparing caller/callee sets of two methods in two versions, Kim’s tool relies on name similarity of methods in those sets Thus, when both caller and callee are renamed, their matching would fail iDiff still works because it is based on the usages of callers and callees and on previously matched entities Second, iDiff is also more advanced in incremental matching: iDiff uses the knowledge from already-matched code entities to incrementally update the similarities of not-yet-matched entities Finally, to compare the methods’ bodies, Kim’s tool uses textual clone detection, while iDiff uses internal interactions, thus, is better in detecting the renamed/moved methods that are also largely modified To map two program entities across two versions, BEAGLE [18] first computes internal similarity based on the metrics like Cyclomatic, S-, and D-Complexity Then, it computes the similarity of calling structures of two methods in which it considers the methods with the same names in both versions 574 TABLE II C OMPARISON OF E NTITIES M ATCHING Pairs 0.9.17 0.9.18 0.9.19 0.9.20 - 0.9.18 0.9.19 0.9.20 0.9.21 Pairs iDiff 52 145 1504 iDiff Kim 34 48 n/a Kim 5.2 - 5.3 5.3 - 5.4b1 5.4b1 - 5.4b2 5.4b2 - 6.0b1 77 81 11 3062 n/a 60 n/a Pairs iDiff Kim 1.1.8 - 1.2.3 1.2.3 - 1.3.3 1.3.3 - 1.3.4 516 125 13 291 108 13 ∩ 34 48 ∩ 59 ∩ 288 105 13 ∑ 18 97 1504 ∑ 77 22 3062 ∑ 228 20 √ iDiff - Kim 18 94 X ? n/a n/a n/a √ 76 18 n/a √ 204 15 iDiff - Kim X n/a ? 0 n/a iDiff - Kim X 24 ? 0 jFreeChart TP 100% 97% FP 0% 1% n/a n/a jHotDraw TP FP 99% 1% 82% 36% 100% 0% n/a n/a SVNKit TP 90% 75% as the matched methods From such name-matching methods, it examines callee methods and detects other matching pairs Comparing to BEAGLE, iDiff further analyzes the usage sets of those methods, despite their name matching Thus, iDiff still works when methods are renamed/refactored Kim and Notkin [10]’s LSdiff uses a rule inference algorithm to capture code changes at method-headers Their later work [9] infers complex rules to describe changes to classes/fields/methods, their containment relationships, and structural dependencies Vdiff [5] outputs syntactic changes in a Verilog program and provides position-independent differencing Refactoring Recovery These approaches aim to recover refactoring operations that were performed on a program [19], [4] Dig et al [4] recover refactorings including renamed entities in which methods with similar Shingles codes are renamed/moved ones Shingles is an integer encoding scheme for tokens Among the detection of other refactorings, Weissgerber and Diehl [19] detect renamed/moved methods by using clone detection Comparing to iDiff, those approaches not take into account the external interactions of methods (i.e how the methods are used) Thus, they cannot detect the case in which the implementation of a method is largely changed, however, it is still used in the same contexts as before VII C ONCLUSIONS In this paper, we present iDiff, a program differencing tool that is able to display the changes to classes/methods between two versions and to track the corresponding classes/methods even they were reordered/moved/renamed and/or modified The key idea is that during software evolution, an entity could change its location, name, order, and even its internal implementation However, its interaction with other entities would be more stable Entities between two versions are matched via an incremental matching algorithm, which takes into account the similarity of interactions for matching Our empirical evaluation shows that iDiff achieves higher accuracy and efficiency than a state-of-the-art tool FP 10% 25% ∑ 0 n/a ∑ n/a ∑ 3 √ Kim - iDiff X ? n/a n/a √ Kim - iDiff X ? n/a √ 0 n/a n/a 0 n/a Kim - iDiff X ? 0 Time FP iDiff n/a n/a 97s 87s 78s 99s TP FP iDiff 100% 100% n/a 0% 0% n/a 24s 29s 24s 23s TP 0% 100% FP 0% 0% TP Kim 13s 202s 4s >1h Time Kim >1h 201 17s >1h Time iDiff 10m24s 2m15s 2m6s Kim 35m37s 2m24 5s Acknowledgment This project is funded in part by NSF CCF1018600 grant The third author was funded in part by a fellowship from Vietnamese Education Foundation (VEF) R EFERENCES [1] T Apiwattanapong, A Orso, and M J Harrold A differencing algorithm for object-oriented programs In ASE’04, IEEE CS [2] G Canfora, L Cerulo, and M Di Penta Ldiff: An enhanced line differencing tool In ICSE’09, pp 595–598 IEEE CS, 2009 [3] R Cottrell, J J C Chang, R J Walker, and J Denzinger Determining detailed structural correspondence for generalization tasks In ESECFSE ’07, pages 165–174, ACM [4] D Dig and R Johnson Automated detection of refactorings in evolving components In ECOOP’06 Springer, 2006 [5] A Duley, C Spandikow, M Kim A Program Differencing Algorithm for Verilog HDL In ASE’10 IEEE CS, 2010 [6] B Fluri, M Wăursch, M Pinzger, and H C Gall Change distilling tree differencing for fine-grained source code change extraction IEEE TSE, 33(11):18, Nov 2007 [7] M W Godfrey and L Zou Using origin analysis to detect merging and splitting of source code entities IEEE TSE, 31(2):166–181, 2005 [8] S Horwitz Identifying the semantic and textual differences between two versions of a program In PLDI’90, ACM Press [9] M Kim and D Notkin Discovering and representing systematic code changes In ICSE’09 IEEE CS, 2009 [10] M Kim, D Notkin, and D Grossman Automatic inference of structural changes for matching across program versions In ICSE’07 IEEE CS [11] S Kim, K Pan, and E J Whitehead, Jr When functions change their names: Automatic detection of origin relationships In WCRE’05 [12] A Mehra, J Grundy, and J Hosking A generic approach to supporting diagram differencing and merging for collaborative design In ASE’05, pp 204-213, 2005 [13] I Neamtiu, J Foster, and M Hicks Understanding source code evolution using abstract syntax tree matching In MSR’05 [14] J I Maletic and M L Collard Supporting Source Code Difference Analysis In ICSM’04, pp 210–219 IEEE CS, 2004 [15] S Raghavan, R Rohana, D Leon, A Podgurski, and V Augustine Dex: A semantic-graph differencing tool for studying changes in large code bases In ICSM’04, IEEE CS [16] S P Reiss Tracking source locations In ICSE’08, ACM Press [17] W F Tichy The string-to-string correction problem with block moves ACM Trans on Computer Systems, 2(4):309–321,1984 [18] Q Tu and M W Godfrey An integrated approach for studying architectural evolution In IWPC’02, pp 127–136 IEEE CS [19] P Weissgerber and S Diehl Identifying refactorings from source-code changes In ASE’06, pp 231–240, IEEE CS, 2006 [20] Z Xing and E Stroulia UMLDiff: An algorithm for object-oriented design differencing In ASE’05, ACM, 2005 575