An integer linear programming formulation for the Parsimonious Loss of Heterozygosity Problem (PLOHP)

Get Complete Project Material File(s) Now! »

Accounting for perfect phylogeny constraints

In some practical situations it may turn out useful to know not only the haplotype set that resolves a given set of genotypes, but also how the familial relationships among members of a population of indi-viduals reflect on the estimated haplotypes. This version of PPH is known in the literature as the Minimum Perfect Phylogeny Haplotyping problem (MPPH) problem and was first introduced by Bafna et al. [2004]. MPPH requires to find a minimum set of haplotypes resolving the in-put genotypes and forming a perfect phylogeny, i.e., a haplotype set H that does not contain the following matrix [Brown et al. 2006; Pe’er et al. 2004]:

Implementation

The results have been obtained by implementing the CRM for PPH-UD in Mosel 2.0 of Xpress-MP, Optimizer version 18, running on a Pen-tium 4, 3.2 GHz, equipped with 2 GByte RAM and operating system Gentoo release 7 (kernel linux 2.6.17). In our experiments we acti-vated the Xpress-MP Optimizer automatic cuts, the Xpress-MP pre-solving strategy, and used the Xpress-MP primal heuristic to generate the first upper bound.

Benchmark instances

As in Catanzaro et al. [2010b], we used the standard Brown et al. [2006]’s datasets for testing the performances of our model. Specifically, through Hudson’s MS program [Hudson 2002], Brown and Harrower created two families of datasets (called the uniform and nonuniform datasets) by randomly pairing the resulting haplotypes. The distinction in the two simulated methods comes in how the random pairing is per-formed. In the uniform datasets the haplotypes are randomly paired by sampling uniformly from the set of distinct haplotypes. In the nonuniform datasets the haplotypes are sampled uniformly from the collection of haplotypes generated by the coalescent process. In this collection, haplotypes may not be unique, so some haplotypes are sampled with higher frequency than others. Both the uniform and non-uniform datasets consist of collections of 30 or 50 genotypes hav-ing 10, 30, 50, 75 or 100 SNPs each. Each dataset contains a number of instances variable between 15 and 50. The authors also considered biological data from chromosomes 10 and 21, over all four Hap-Map [Consortium 2003] populations. For each input the authors selected sequences having 30, 50, and 75 SNPs, respectively, giving a total of 8 datasets consisting of 3 instances each. Brown and Harrower’s datasets are not subjected to uncertainty, for this reason we considered a three sets of random generated error mask matrices having an error ratio (i.e., the number of entries equal to 1 divided by mp) equal to 5%, 10%, and 15% respectively. Brown et al. [2006]’s datasets and the error mask matrices used in our experiments can be downloaded at the address: http://homepages.ulb.ac.be/~lporrett/PPHUD/PPHerr. zip.
In Tables 1-3 we show the performances of the CRM for PPH-UD under different error ratios. Specifically, the columns of Tables 1-3 ev-idence the average, the maximum, and the minimum of: the solution time, the gap (i.e., the difference between the optimal value found and the value of linear relaxation at the root node of the search tree, divided by the optimal value), and the number of nodes expanded in each group of instances belonging to a given dataset.

READ Concepts of coloniality of knowledge, being and power

Computational performances

In order to obtain a qualitative measure of the running time perfor-mances of the CRM for PPH-UD, we compared the numerical results of the model with the corresponding ones of the CRM for PPH (RM version, see Catanzaro et al. [2010b] running on the same datasets in absence of uncertainty. The performances of the CRM for PPH are shown in Table 4. Moreover, in order to obtain a qualitative measure of the accuracy of the CRM for PPH-UD, we compared the optimal so-lutions provided by CRM for PPH in absence of uncertainty with the corresponding ones provided by CRM for PPH-UD. Specifically, fixed a generic instance of PPH-UD, we computed the accuracy of the CRM for PPH as the ratio between the number of matching haplotypes in the solutions provided by both the CRM for PPH-UD and the CRM for PPH divided the overall number of haplotypes in the solution provided by CRM for PPH. The accuracy of the CRM for PPH-UD under increasing er-ror ratios it is shown in Table 5. For sake of notation, in the following subsections we shall denote CRM1 and CRM2 as the CRM for PPH and the CRM for PPH-UD, respectively.

Table of contents :

1 introduction
1.1 Motivation
1.2 Genome-Wide Association Studies
1.3 Combinatorial Optimization
1.4 Outline of the Thesis
2 a class representative model for pph-ud
2.1 Notation And Problem Statement
2.2 Design and Implementation
2.2.1 Reducing model size
2.2.2 Accounting for perfect phylogeny constraints
2.3 Experimental results
2.3.1 Implementation
2.3.2 Benchmark instances
2.3.3 Computational performances
3 a branch-and-price algorithm for the plohp
3.1 Notation and Problem Formulation
3.2 A Branch-and-Price algorithm for the PLOHP
3.2.1 An integer linear programming formulation for the Parsimonious Loss of Heterozygosity Problem (PLOHP)
3.2.2 Finding a maximum node-weighted clique in a Max-Point Tolerance Graph (MPTG)
3.2.3 Initializing the columns of the RMP
3.2.4 Branching rules
3.2.5 Preprocessing
3.3 Experimental Results
3.3.1 Implementation
3.3.2 Calibrating the pricing oracle
3.3.3 Benchmark instances
3.3.4 Computational performances
3.3.5 Computational performances on larger datasets of the PLOHP
4 an ip formulation of the mipairp
4.1 Notation and Problem Statement
4.2 The ALU Graph
4.3 The complexity of PAIRP and MIPAIRP
4.4 An Integer Programming Model for the MIPAIRP
4.5 Experimental Results
4.5.1 Implementation
4.5.2 Benchmark Instances
4.5.3 Computational performances
5 conclusion
5.1 Contribution
5.2 Perspectives
bibliography