Get Complete Project Material File(s) Now! »
Merge Conflicts and Resolutions in Git-based Open Source Projects
In this chapter, we present our study about merge conflict and resolution in Git-based open source projects.
Version control systems make it easy for users to work in parallel on a shared project. It means that the project is easy to be diverted, i.e two users have two different version of a project and they need to ‘synchronize’/‘integrate’ these two versions to have a common result. Version control systems support the ‘merge’ function to merge parallel changes made by two different users. Allowing concurrent changes is very important to support collaborative work, however, merging different parallel changes might require a significant period of time which can alter the productivity gain. The most basic and effective merging technique is ‘textual merging’ [20]. ‘Textual merging’ consider software artifacts (i.e text files, source files, configuration files) as flat text files and the indivisible unit (i.e atom) are the lines of text. This is also called as line-based merging approach.
Studies showed that in large projects the partition of software modules among developers is limited and developers can contribute to any part of the code [73]. It means that two users can edit the same file concurrently. If they edit in different part of the file, their changes are ‘conflicting’ and need to be integrated. Most of version control systems can merge concurrent changes of the same file automatically. However, if they edit at the same part of the file which is usually the same line or two adjacent lines, the system can not merge their changes successfully. This case is denoted as ‘unresolved conflict’ and the users has to resolve it manually. Unresolved conflicts also occur if a file is renamed and modified/deleted concurrently, if it is modified and deleted concurrently, if it is renamed concurrently by two users, if a user renames a file with the same name as another user gives to a concurrently created file and if two users concurrently add two files with the same name. Note that by the name of a file we understand the whole path identifying that file.
Conflicts are costly as they delay the development process [67]. In the period of time between conflicts occur and they are discovered and understood, they might grow and become difficult to resolve. Developers may postpone integrating parallel work as they fear that conflicts may be hard to resolve. This concern of potential conflicts makes parallel work to diverge more and conflicts are more likely to happen and grow. In this work, we studied the conflict of concurrent editing activities in Git, the most popular decentralized version control system [17]. In Git, users can synchronize their changes with other users working in parallel with them. In this process, a merge is performed between local changes and remote changes. Conflicts could happen during this merging process. Understanding how often and when conflicts are more likely to happen during the development process and how users resolve them can help proposing awareness mechanisms that can prevent conflicts from happening. This study could help proposing better merging approaches that minimize conflicts that users have to manually resolve. We analyzed traces of projects developed with Git in order to quantitatively analyse the different types of textual conflicts at the level of files that arise at the different development cycle phases. One particular type of unresolved conflict that we study is that referring to adjacent lines. If concurrent changes occur on two adjacent lines Git signals an unresolved conflict, but not in the case of two lines separated by two or more lines. As there is no reason why these cases are treated differently, we aim studying whether developers resolve them differently. We also aim quantitatively measuring merge user satisfaction after a conflict resolution in terms of how often users roll back to a previous version. Even if several existing studies focused on parallel changes and conflicts on Git-based projects [15, 16], they did not analyse fine-grained conflicts at file level and their resolution mechanisms.
Several tools rather than using textual merging, use syntactic or semantic merging [20]. Syntactic merging takes the syntax of software artifacts into account, while semantic merging considers semantic information. Studies such as [15] and [16] considered both textual conflicts as result of textual merging and higher order conflicts that are conflicts at semantic level that cause compilation errors or test failures. Other studies such as [49] studied indirect conflicts when changes to one software artefact affect concurrent changes to another artefact. In [49] authors proposed the social call graph that describes dependencies between software developers for a piece of code. The social call graph combines the call graph data structure that contains all the dependency relationships of a software application with authorship information. Our study does not investigate indirect and higher level conflicts and focuses uniquely on conflicts related to the same file with a particular attention for textual conflicts as used by main DVCSs such as Git.
The rest of this chapter is organized as follows. Section 3.2 presents our conflicts measurement during the merge process. Section 3.3 discusses implications for design for our study and some limitations of analysing Git repositories. Section 3.4 gives some concluding remarks.
Measurements
In order to measure the level of parallelism and the proportion of conflicting modifications in DVCSs, we adopted an experimental methodology where we analysed the corpus of four large open-source projects developed using Git :
— Ruby on Rails [32] is a web framework, with integrated support for unit, functional, and integration testing. We analysed version 5.0.0.alpha of this project.
— IkiWiki [33] is a wiki software system that compiles wiki pages into HTML pages for publication. We analysed IkiWiki version 3.0.
— Samba [34] is an implementation of networking protocols to share files and printers between Unix computers and Windows computers. We analysed Samba 3.0.x.
— Linux Kernel [35] is an implementation of a Unix-like computer operating system kernel. We analysed version 4.x of the Linux kernel.
Beside the large size and the popularity of these projects, they are representative for the different software development pull-based [46] models that they adopt. In practice, the core-development-team will organize at least one repository as the primary repository where the latest approved changes can be found. Contributors can clone from this official repository. However, only the core-development-team has the write-access to commit directly. Other contributors need to use the pull-based development model in which a contributor creates a pull-request for his changes. A core-team’s member then inspects the changes and decides to pull and merge contributor’s changes to the main repository or not. And in some cases, contributors are requested to update or add more changes before their pull-request is accepted. Nowadays, the pull-based model is naturally supported by web-based hosting services such as [47] and [48].
Rails project uses pull-request model which is naturally supported by [47]. Contributors can fork (clone) from the official GitHub repository and contribute via GitHub’s pull-requests. In a pull-request, reviewers and its contributor communicate directly using pull-request’s comments. These comments are available to other users and they can participate into this conversation. Afterwards a pull-request can be merged to the main line or declined.
IkiWiki looks like a private repository where contributors send their patches to Joey Hess, the main developer of the project.
Samba uses a shared repository among registered contributors. It uses an auto-build system for code-review process. Contributors need to join a technical mailing-list before contributing.
Linux Kernel uses a pull-based model via mailing-list. Contributors need to send their patches to the appropriate subsystem maintainer’s mailing-list in charge of the different parts of the project.
Table 3.2 presents some details about these projects : the period of their development (until 05-October-2015), the number of commits, the number of contributors (authors), the number of created files during the lifetime of the project and the number of existing files on 05-October-2015. Note that if a file is moved during the lifetime of the project from a place to another, we counted it as a new created file.
In contrast to CVCSs, Git does not support the centralized logging feature of all user acti-vities. The best overview of user activities is provided by the commit history (including merges) from the primary repository. To identify concurrences and conflicts in each project, we created a shadow repository and recursively re-integrated developer’s changes into this repository. In other words, by means of Python scripts [80] we re-played all merges that were performed during the development period of each project.
Integrations and conflicts on files
We first determined the number of concurrent updates to a same file and then the number of concurrent updates to a same file that resulted in unresolved conflicts. Similar to [14] we computed the integration rate and conflict rate as provided in Table 3.2. File updates represents the total number of updates to files. A file can be updated several times throughout the development cycle. Integration rate represents the proportion of concurrent updates to a same file over all updates to files. Conflict rate is calculated by the proportion of updates to a same file that resulted in unresolved conflicts over concurrent updates to files. The file updates were collected from all commits of the project. And by re-integrating all developer’s changes, we computed the concurrent updates to a same file and the concurrent updates to a same file that resulted in unresolved conflicts.
We can notice that Kernel and Rails projects have larger integration rate than IkiWiki and Samba. For instance, integration rate in Kernel project is 10 times larger than IkiWiki and 16 times larger than Samba. This can be explained by the large size of Kernel project in terms of the number of files. In contrast with the integration rate, Rails and Kernel have smaller conflict rates than IkiWiki (50.50%) and Samba (87.84%). We do know that Rails is a large project using advantages of GitHub, which supports pull-based model naturally. GitHub interface allows not only the author of a pull-request and the reviewer but also other contributors and core-team members to discuss about that pull-request and its issues. It brings a big advantage of sharing collaborators knowledge to solve problems during integration. In case of Linux Kernel, it uses pull-based model via mailing list with a list of subsystem maintainers. It also has a list of delegated servers, such as linux-next, where commits are tested before they are pushed to primary repository [61]. On the other side, Samba uses shared repositories among contributors and IkiWiki is maintained as a private repository by Joey Hess. Nowadays, all of them provide a list of Todo tasks and a list of Bugs where contributors can focus their work to avoid conflicting integration.
The lack of a central server that holds a reference copy of the project introduces more pa-rallelism between user versions allowing them to diverge more in DVCSs than in CVCSs. For instance, Kernel, Rails, IkiWiki and Samba projects developed in Git have significantly (99% confidence level) higher integration rate (22, 8, 2 and 1.5 times respectively) than projects in CVS analysed in [14]. However, the higher integration rate does not result into higher conflict rate. For instance, Kernel and Rails have 5 and 1.5 times respectively lower conflict rate than projects in CVS whereas Samba and IkiWiki have almost 2 times higher conflict rate. The conflict rate in Git’s projects depends on collaboration process management.
We also measured the proportion of the different conflicts types : content conflicts referring to conflicts inside a file, remove/update conflicts referring to concurrent removal and update of a file and naming conflicts referring to concurrent renaming of the same file or of two files with the same name. Table 3.3 presents the proportion of conflict types of four projects. We found that content conflict is far the most popular type of conflict with a proportion of 46% – 90% from all conflict types.
Integrations and conflicts based on release dates
In the previous section we presented the integration rate and conflict rate of four projects over their whole development period. However during their development life-cycle, activities are not equally distributed. Our hypothesis is that collaborative activities achieve some peaks around project release dates, such as periods of one or two weeks before a release date. To gain a better understanding about collaboration during those active periods, we conducted an analysis about integrations and conflicts based on project release dates.
Figure 3.1 illustrates four active periods of one week length before and after respectively the release date (RD). We denote these periods as follows : B2W (between two weeks before RD and one week before RD), B1W (between one week before RD to RD), A1W (between RD to one week after RD) and A2W (one week after RD to two weeks after RD). We also analysed B4W, B3W, A3W and A4W.
Table of contents :
Chapter 1
Introduction
1.1 Study Context
1.2 Research Questions
1.3 Contributions
1.3.1 The first part : Conflicts and resolutions in Git-based open-source projects
1.3.2 The second part : Time-position characterization of CE using ShareLaTeX
1.4 Structure of the Thesis
Chapter 2
State of the art
2.1 Basic notions
2.1.1 Version control system
2.1.2 Git- A decentralized version control system
2.1.3 Real-time collaborative editing
2.2 Studies on Collaborative work based on version control systems
2.3 Studies on Collaborative editing using real-time web-based collaborative editors
2.4 Awareness in collaborative editing
2.5 Chapter conclusion
Chapter 3
Merge Conflicts and Resolutions in Git-based Open Source Projects
3.1 Introduction
3.2 Measurements
3.2.1 Integrations and conflicts on files
3.2.2 Integrations and conflicts based on release dates
3.2.3 Conflict resolution
3.2.4 Adjacent-line conflicts
3.3 Discussion
3.4 Chapter conclusion
Chapter 4
Time-position characterization of Conflicts in Collaborative Editing
4.1 Introduction
4.2 Related work
4.3 Measurements
4.3.1 Time dimension
4.3.2 Time-position analysis
4.4 Algorithms
4.4.1 The calculation of ‘time-distance’ and ‘position-distance’
4.4.2 Time, Position and Time-Position collaborative edits classifying
4.4.3 ‘Potential border conflict’ and ‘Potential insertion conflict’ detection
4.5 Chapter conclusion
Chapter 5
Conclusion and Perspectives
5.1 Conclusion
5.2 Publications
5.3 Perspectives
5.3.1 Analysis of CE’s traces of Git-based projects
5.3.2 Higher-order conflicts and roll-back action
5.3.3 Time-position analysis of CE’s logs collected from real-time web-based collaborative editors
5.3.4 Potential conflicts in real-time collaborative editing