Machine learning – Project topics materials

Get Complete Project Material File(s) Now! »

CHAPTER 2 BACkGROUND

Introduction

This chapter discusses the literature review and some of the topics that are fundamental to software defect prediction. Software defect prediction has received substantial attention in the software industry. Software defect data sources, metrics and types of machine learning are explored. Previous research has been conducted to analyse the effect that metrics has on fault proneness. Some of the defect prediction papers that have been published since 2007 have been reviewed. The defect data that has been used in previous research emanates from different sources (Madeyski & Jureczko 2015: 393-422; Muthukumaran, Choudhary & Murthy 2015: 15-20;Bowes, Hall, Harman, Jia, Sarro, Wu 2016:330-341;Fukushima, Kamei, McIntosh, Yamashita & Ubayashi 2014: 172-181), which include open source and industrial projects.

Data sources

In general, most of the data used in software defect prediction is obtained from the freely available open source repositories, which include the source code management systems and bug tracking systems. Other data is sourced from industrial projects.

Company/Industrial data

Software defect data is sourced from company software development operations. The difference between the development processes of industrial and OSS may affect the defect prediction results (Madeyski & Jureczko 2015: 393-422). In company environments, formal, centralised methods are applied in software development. These processes include formal software verification techniques. Co-located, well-structured teams develop data. Responsibilities may be divided between members of a team. Functional teams are generally used in software development organisations. Developers with similar skills are grouped together. One team in a company may design the interface; another may be focused on database design, while the other team may do implementation and testing.
Product teams working on industrial projects are organised, unlike the open source ones. In a study conducted by (Madeyski & Jureczko 2015: 393-422), at least one process metric in all versions of industrial software improved the prediction models, but this was not true for nearly half of the analysed versions of open source projects. This could be due to the organised manner in the development processes of industrial software.

Open source code repository

Task allocations and relationships between users and developers in open source development are less formal. Development processes are more decentralised in open source environments (Madeyski Jureczko 2015: 393-422). Open source projects are developed as global collaborations of skilled developers. The developers apply different skills than in the industrial projects. The authors who commit software modifications in open source development are active in development, while their counterparts are less active. Therefore, less training, support and technical skills are required to develop OSS.

Bug life cycle participants

According to Ullah & Khan (2011:98-108) there are many contributors in the bug life cycle. They have responsibilities and roles, some of which are as follows:
Bug reporter
This is the participant who reveals the bug and creates a report for it, by entering the bug data in the bug tracking tool. The reporter inputs the bug details that include the title, bug priority, severity, dependencies and the component where the bug is located.
Bug group
This consists of people who regularly receive updates concerning the bug in a bug report. They include the bug reporter, the developer, tester and the quality assurance manager.
Bug owner
The bug owner ensures that information about the bug in the bug tracking system is adequate. The owner manages bugs and guarantees that, for example, high priority bugs in the system are fixed within the shortest possible time.

Bug tracking system

This is a software system that tracks the progress of a bug. A reported bug is analysed, allocated to a developer, fixed and resolved (Babar, Vierimaa &Oivo 2010: 1-407). The bug tracking system records the characteristics of the bugs, such as, defect reported date, the section in which the bug was located, commit date and other properties concerning the bug (Shihab, Ihara, Kamei, Ibrahim, Ohira, Adams, Hassan & Matsumoto 2013: 1005-1042), (see Table 2.1).
The Bugzilla Life Cycle is displayed in Figure 2.1, according to the manner Bugzilla users check and modify the bug status in the database (Sunindyo, Moser, Winkler & Dhungana 2012: 84-102).This life cycle is regarded as the model for developing software projects, particularly for OSS projects. The stages in the cycle demonstrate the procedures followed by OSS developers when modifying bugs. The processes employed when modifying the bug status are regarded as engineering processes and the phases are similar to those in the software development life cycle.
Initially, a bug is presented by users or contributors as an unconfirmed bug. The existence of the bug is then verified and the bug state is changed to ‘new’.
The new bug is allocated to other contributors or fixed instantly. Bugs that are verified as fixed are closed. Some bugs may be wrongly labelled, ‘RESOLVED’ and may need to be reopened. The Bugzilla bug states are meant to assist the contributors to specify bug status. Contributors may also devise their personal state names.
The defect data used in this research was extracted from Bugzilla and Jira repositories (Ambros, Lanza & Robbes 2010: 31-41). Bug fixes made to the Mylyn, JDT, Lucene, Equinox and PDE open source projects were saved and used to create the defect files. In general, some of the open source systems are used in business to save time and costs. Contributors write modules and correct reported bugs. The contributors’ updates are saved in defect files that are used in software defect prediction.

READ The design of the evaluation apparatus CappWag

Defect prediction approaches

Software quality is an extensively researched area in the software engineering domain (Seliya, Khoshgoftaar & Van Hulse 2010: 26-34). Techniques such as unit testing, code inspections and defect prediction are applied to reduce defects in quality assurance activities (Seliya et al. 2010:26 ;Tan, Peng, Pan & Zhao 2011:244-248 ;Ahmed, Mahmood & Aslam 2014:65-69). Software developers may predict and remove defects in new versions of software (Kastro & Bener 2008: 543-562).

Single version software

One version of software is developed. There is an assumption that the present piece of code determines the existence of future defects. The single version approaches do not depend on the software’s historical data, but examine more its present structure, using different metrics.

Versioning systems

Process metrics are derived from the versioning software. These approaches consider that the newly or regularly modified files are the most possible origin of imminent bugs. Hassan presented the entropy concept to evaluate the code modifications (Hassan 2009:78-88). The FreeBSD, NetBSD, OpenBSD, KDE, KOffice, and PostgreSQL applications were used to assess the entropy metrics. The results proved that the amount of preceding bugs is a better predictor than the number of previous file modifications.
A version control system (VCS) is a repository of files that supports the revision of software and the management of application changes. Revisions are a result of software modifications. VCSs facilitate distributed and collaborated software development. Modifications to software are tracked and references are generated for commits that alter the application (Thongtanunam, Mcintosh, Hassan & Iida 2016: 1039-1050). Sites such as GitHub, SourceForge and Google Code support version control (Yu, Mishra & Mishra 2014:457:466). The services that are provided by the sites include archiving, online code browsing, bug trackers, version downloads and web hosting. Companies that do not have resources to manage their own servers utilise the version control services provided by the web hosting sites. Source Code Control System (SCCS) and Revision Control Systems were created in the 1970s and 1980s respectively. The SCCS and RCS software tools store file versions, while subsequent systems also permitted for remote and mainly centralised repository of the file releases (Cochez, Isomottonen, Tirronen & Itkonen 2013:210).
A multi-sited version control system is a distributed VCS that is administered at different locations to align the development work of numerous people that team up to build a single piece of software. The CVS used to be the most popular open source version control system, but has been surpassed by GitHub and Subversion. Concurrent Versions System (CVS) and Subversion (SVN) are common centralised systems. In distributed version control systems (DVCS), individual users have local copies of the storage, which can be synchronised with other storages. Git and Mercurial use this kind of decentralised system.

CHAPTER 1 INTRODUCTION
1.1 Background
1.2 Software defects
1.3 Software quality management
1.4 Software testing
1.5 Software fault tolerance
1.6 Software product line and versioning
1.7 Testing software product lines
1.8 Problem statement
1.9 Research questions
1.10 Research objectives
1.11 Research methodology
1.12 Limitations of the study
1.13 Thesis outline
1.14 Chapter summary
CHAPTER 2 BACKGROUND
2.1 Introduction
2.2 Data sources
2.3 Defect prediction approaches
2.4 Software metrics
2.5 Machine learning
2.6 Literature review
2.7 Chapter summary
CHAPTER 3 EXPERIMENTAL DESIGN AND METHODOLOGY
3.1 Introduction
3.2 Research
3.3 Research experiment
3.4 Chapter Summary
CHAPTER 4 INFORMATION THEORY
4.1 Introduction
4.2 Shannon’s entropy and information theory
4.3 Information theory measures
4.4 Chapter summary
CHAPTER 5 FEATURE SELECTION
5.1 Introduction
5.2 Feature selection
5.3 Feature relevance and redundancy
5.4 Feature weighting
5.5. Feature ranking
5.6 Discretisation of attributes
5.7 Feature selection processes
5.8 Feature extraction methods
5.9 Feature selection methods
5.10 Chapter summary
CHAPTER 6 PREDICTION MODEL EVALUATION .
6.1 Introduction
6.2 Statistical comparison of classification algorithms
6.3 Classification results
6.4 Threats to validity
6.5 Chapter summary
CHAPTER 7 DISCUSSION AND CONCLUSION
7.1 Introduction
7.2 Discussion
7.3 Contribution to knowledge
7.4 Limitations of the study
7.5 Conclusion
7.6 Future work
REFERENCE
GET THE COMPLETE PROJECT