Test Amplification For Behavioral Changes Detection Of Commits

Get Complete Project Material File(s) Now! »

Amplification by Adding New Tests as Variants of Existing Ones

The most intuitive form of test amplification is to consider an existing test suite, then gen-erate variants of the existing test cases and add those new variants into the original test suite. This kind of test amplification is denoted as AM Padd. Definition: A test amplification technique AM Padd consists of creating new tests from existing ones to achieve a given engineering goal. The most commonly used engineering goal is to improve coverage according to a coverage criterion. The works listed in this section fall into this category and have been divided according to their main engineering goal.

Coverage or Mutation Score Improvement

Baudry et al. [Baudry 2005b] [Baudry 2005a] improve the mutation score of an existing test suite by generating variants of existing tests through the application of specific trans-formations of the test cases. They iteratively run these transformations, and propose an adaptation of genetic algorithms (GA), called a bacteriological algorithm (BA), to guide the search for test cases that kill more mutants. The results demonstrate the ability of search-based amplification to significantly increase the mutation score of a test suite. They evaluated their approach on 2 case studies that are .NET classes. The evaluation shows promising results, however the result have little external validity since only 2 classes are considered. Tillmann and Schulte [Tillmann 2006] describe a technique that can generalize existing unit tests into parameterized unit tests. The basic idea behind this technique is to refactor the unit test by replacing the concrete values that appear in the body of the test with pa-rameters, which is achieved through symbolic execution. Their technique’s evaluation has been conducted on 5 .NET classes. The problem of generalizing unit tests into parameterized unit tests is also studied by Thummalapenta et et al. [Marri 2010]. Their empirical study shows that unit test general-ization can be achieved with feasible effort, and can bring the benefits of additional code coverage. They evaluated their approach on 3 applications from 1 600 to 6 200 lines of code. The result shows an increase of the branch coverage and a slight increase of the bug detection capability of the test suite.
To improve the cost efficiency of the test generation process, Yoo and Harman [Yoo 2012] propose a technique for augmenting the input space coverage of the existing tests with new tests. The technique is based on four transformations on numerical values in test cases, i.e. shifting ( x:x + 1 and x:x 1 ) and data scaling (multiply or divide the value by 2). In addition, they employ a hill-climbing algorithm based on the number of fitness function evaluations, where a fitness is the computation of the euclidean distance between two input points in a numerical space. The empirical evaluation shows that the technique can achieve better coverage than some test generation methods which generate tests from scratch. The approach has been evaluated on the triangle problem. They also evaluated their approach on two specific methods from two large and complex libraries.
To maximize code coverage, Bloem et al. [Bloem 2014] propose an approach that alters existing tests to get new tests that enter new terrain, i.e. uncovered features of the program. The approach first analyzes the coverage of existing tests, and then selects all test cases that pass a yet uncovered branch in the target function. Finally, the approach investigates the path conditions of the selected test cases one by one to get a new test that covers a previously uncovered branch. To vary path conditions of existing tests, the approach uses symbolic execution and model checking techniques. A case study has shown that the approach can achieve 100% branch coverage fully automatically. They first evaluate their prototype implementation on two open source examples and then present a case study on a real industrial program of a Java Card applet firewall. For the real program, they applied their tool on 211 test cases, and produce 37 test cases to increase the code coverage. The diversity of the benchmark allows to make a first generalization.
Rojas et al. [Rojas 2016] have investigated several seeding strategies for the test gener-ation tool Evosuite. Traditionally, Evosuite generates unit test cases from scratch. In this context, seeding consists in feeding Evosuite with initial material from which the automatic generation process can start. The authors evaluate different sources for the seed: constants in the program, dynamic values, concrete types and existing test cases. In the latter case, seeding analogizes to amplification. The experiments with 28 projects from the Apache Commons repository show a 2% improvement of code coverage, on average, compared to a generation from scratch. The evaluation based on Apache artifacts is stronger than most related work, because Apache artifacts are known to be complex and well tested.
Patrick and Jia [Patrick 2017] propose Kernel Density Adaptive Random Testing (KD-ART) to improve the effectiveness of random testing. This technique takes advantage of run-time test execution information to generate new test inputs. It first applies Adaptive Random Testing (ART) to generate diverse values uniformly distributed over the input space. Then, they use Kernel Density Estimation for estimating the distribution of val-ues found to be useful; in this case, that increases the mutation score of the test suite. KD-ART can intensify the existing values by generating inputs close to the ones observed to be more useful or diversify the current inputs by using the ART approach. The authors explore the trade-offs between diversification and intensification in a benchmark of eight C programs. They achieve an 8.5% higher mutation score than ART for programs that have simple numeric input parameters, but their approach does not show a significant increase for programs with composite inputs. The technique is able to detect mutants 15.4 times faster than ART in average. Instead of operating at the granularity of complete test cases, Yoshida et et al. [Yoshida 2016] propose a novel technique for automated and fine-grained incremental gen-eration of unit tests through minimal augmentation of an existing test suite. Their tool, FSX, treats each part of existing cases, including the test driver, test input data, and oracles, as “test intelligence », and attempts to create tests for uncovered test targets by copying and minimally modifying existing tests wherever possible. To achieve this, the technique uses iterative, incremental refinement of test-drivers and symbolic execution. They evaluated FSX using four benchmarks, from 5K to 40K lines of code. This evaluation is adequate and reveals that FSX’ result can be generalized.

Fault Detection Capability Improvement

Starting with the source code of test cases, Harder et et al. [Harder 2003] propose an approach that dynamically generates new test cases with good fault detection ability. A generated test case is kept only if it adds new information to the specification. They define “new information” as adding new data for mining invariants with Daikon, hence producing new or modified invariants. What is unique in the paper is the augmentation criterion: helping an invariant inference technique. They evaluated Daikon on a benchmark of 8 C programs. These programs vary from 200 to 10K line of code. It is left to future work to evaluate the approach on a real and large software application. Pezze et et al. [Pezze 2013] observe that method calls are used as the atoms to construct test cases for both unit and integration testing, and that most of the code in integration test cases appears in the same or similar form in unit test cases. Based on this observation, they propose an approach which uses the information provided in unit test cases about object creation and initialization to build composite cases that focus on testing the interactions between objects. The evaluation results show that the approach can reveal new interaction faults even in well tested applications.
Writing web tests manually is time consuming, but it gives the developers the advan-tage of gaining domain knowledge. In contrast, most web test generation techniques are automated and systematic, but lack the domain knowledge required to be as effective. In light of this, Milani et al. [Milani Fard 2014] propose an approach which combines the advantages of the two. The approach first extracts knowledge such as event sequences and assertions from the human-written tests, and then combines the knowledge with the power of automated crawling. It has been shown that the approach can effectively improve the fault detection rate of the original test suite. They conducted an empirical evaluation on 4 open-source and large JavaScript systems.

READ Electronic properties of a solid

Oracle Improvement

Pacheco and Ernst implement a tool called Eclat [Pacheco 2005b], which aims to help the tester with the difficult task of creating effective new test inputs with constructed oracles. Eclat first uses the execution of some available correct runs to infer an operational model of the software’s operation. By making use of the established operational model, Eclat then employs a classification-guided technique to generate new test inputs. Next, Eclat reduces the number of generated inputs by selecting only those that are most likely to reveal faults. Finally, Eclat adds an oracle for each remaining test input from the operational model automatically. They evaluated their approach on 6 small programs. They compared Eclat’s result to the result of JCrasher, a state of the art tool that has the same goal than Eclat. In their experimentation, they report that Eclat perform better than JCrasher: Eclat reveals 1.1 faults on average against 0.02 for JCrasher.
Given that some test generation techniques just generate sequences of method calls but do not contain oracles for these method calls, Fraser and Zeller [Fraser 2011c] propose an approach to generate parametrized unit tests containing symbolic pre- and post-conditions. Taking concrete inputs and results as inputs, the technique uses test generation and muta-tion to systematically generalize pre- and post-conditions. Evaluation results on five open source libraries show that the approach can successfully generalize a concrete test to a parameterized unit test, which is more general and expressive, needs fewer computation steps, and achieves a higher code coverage than the original concrete test. They used 5 open-source and large programs to evaluate the approach. According to their observation, this technique is more expensive than simply generating unit test cases.

Debugging Effectiveness Improvement

Baudry et al. [Baudry 2006] propose the test-for-diagnosis criterion (TfD) to evaluate the fault localization power of a test suite, and identify an attribute called Dynamic Basic Block (DBB) to characterize this criterion. A Dynamic Basic Block (DBB) contains the set of statements that are executed by the same test cases, which implies all statements in the same DBB are indistinguishable. Using an existing test suite as a starting point, they apply a search-based algorithm to optimize the test suite with new tests so that the test-for-diagnosis criterion can be satisfied. They evaluated their approach on two programs: a toy program and a server that simulates business meetings over the network. These two programs are less than 2K line of code long, which can be considered as small. Rö ler et al. [Rö ler 2012] propose BugEx, which leverages test case generation to systematically isolate failure causes. The approach takes a single failing test as input and starts generating additional passing or failing tests that are similar to the failing test. Then, the approach runs these tests and captures the differences between these runs in terms of the observed facts that are likely related with the pass/fail outcome. Finally, these differences are statistically ranked and a ranked list of facts is produced. In addition, more test cases are further generated to confirm or refute the relevance of a fact. It has been shown that for six out of seven real-life bugs, the approach can accurately pinpoint important failure explaining facts. To evaluate BugEx, they use 7 real-life case studies from 68 to 62K lines of code. The small number of considered bugs, 7, calls for more research to improve external validity. Yu et al. [Yu 2013] aim at enhancing fault localization under the scenario where no ap-propriate test suite is available to localize the encountered fault. They propose a mutation-oriented test case augmentation technique that is capable of generating test suites with bet-ter fault localization capabilities. The technique uses some mutation operators to iteratively mutate some existing failing tests to derive new test cases potentially useful to localize the specific encountered fault. Similarly, to increase the chance of executing the specific path during crash reproduction, Xuan et al. [Xuan 2015] propose an approach based on test case mutation. The approach first selects relevant test cases based on the stack trace in the crash, followed by eliminating assertions in the selected test cases, and finally uses a set of predefined mutation operators to produce new test cases that can help to reproduce the crash. They evaluated MuCrash on 12 bugs for Apache Commons Collections, which is 26 KLoC of source code and 29 KLoC of test code length. The used program is quite large and open-source which increases the confidence. but using a single subject is a threat to generalization.

Table of contents :

List of Figures
List of Tables
1 Introduction
1.1 Scientific Problem
1.2 Thesis Contributions
1.3 STAMP project
1.4 Publications
1.5 Software Developed During This Thesis
2 State of the Art
2.1 Approach
2.1.1 Definition
2.1.2 Methodology
2.2 Amplification by Adding New Tests as Variants of Existing Ones
2.2.1 Coverage or Mutation Score Improvement
2.2.2 Fault Detection Capability Improvement
2.2.3 Oracle Improvement
2.2.4 Debugging Effectiveness Improvement
2.2.5 Summary
2.3 Amplification by Synthesizing New Tests with Respect to Changes
2.3.1 Search-based vs. Concolic Approaches
2.3.2 Finding Test Conditions in the Presence of Changes . .
2.3.3 Other Approaches
2.3.4 Summary
2.4 Amplification by Modifying Test Execution
2.4.1 Exception Handling Validation
2.4.2 Other Approaches
2.4.3 Summary
2.5 Amplification by Modifying Existing Test Code
2.5.1 Input Space Exploration
2.5.2 Oracle Improvement
2.5.3 Purification
2.5.4 Summary
2.6 Analysis
2.6.1 Aggregated View
2.6.2 Technical Aspects
2.6.3 Tools for Test Amplification
2.7 Conclusion
3 DSpot: A Test Amplification Technique
3.1 Definitions
3.2 Overview
3.2.1 Principle
3.2.2 Input & Output
3.2.3 Workflow
3.2.4 Test Method Example
3.3 Algorithm
3.3.1 Input Space Exploration Algorithm
3.3.2 Assertion Improvement Algorithm
3.3.3 Pseudo-algorithm
3.3.4 Flaky tests elimination
3.4 Implementation
3.5 Conclusion
4 Test Amplification For Artificial Behavioral Changes Detection Improvement
4.1 Mutation score as test-criterion
4.2 Experimental Protocol
4.2.1 Research Questions
4.2.2 Dataset
4.2.3 Test Case Selection Process
4.2.4 Metrics
4.2.5 Methodology
4.3 Experimental Results
4.3.1 Answer to RQ1
4.3.2 Answer to RQ2
4.3.3 Answer to RQ3
4.3.4 Answer to RQ4
4.4 Threats to Validity
4.5 Conclusion
5 Test Amplification For Behavioral Changes Detection Of Commits
5.1 Motivation & Background
5.1.1 Motivating Example
5.1.2 Applicability
5.1.3 Behavioral Change
5.1.4 Behavioral Change Detection
5.2 Behavioral Change Detection Approach
5.2.1 Overview of DCI
5.2.2 Test Selection and Diff Coverage
5.2.3 Test Amplification
5.2.4 Execution and Change Detection
5.2.5 Implementation
5.3 Evaluation
5.3.1 Research Questions
5.3.2 Benchmark
5.3.3 Protocol
5.3.4 Results
5.4 Discussion about the scope of DCI
5.5 Threats to validity
5.6 Conclusion
6 Transversal Contributions
6.1 Study of Program Correctness
6.1.1 Problem Statement
6.1.2 ATTRACT Protocol
6.1.3 Evaluation
6.1.4 Discussion
6.2 Study of Pseudo-tested Methods
6.2.1 Problem Statement
6.2.2 Definition and Implementation
6.2.3 Evaluation
6.2.4 Discussion
6.3 Study of Test Generation for Repair
6.3.1 Problem Statement
6.3.2 UnsatGuided Technique
6.3.3 Evaluation
6.3.4 Discussion
6.4 Conclusion
7 Conclusion
7.1 Contribution Summary
7.1.1 DSpot
7.1.2 Automatic Test Amplification For Mutation Score
7.1.3 Automatic Test Amplification For Behavioral Changes Detection .
7.1.4 Transversal Contributions
7.2 Short-term Perspectives
7.2.1 Prettifying Amplified Test Methods
7.2.2 Collecting Developer Feedback
7.3 Long-term Perspectives
Bibliography