Experimental approaches for disease gene identification

Get Complete Project Material File(s) Now! »

Regulatory network inference

Chapter 3 is dedicated to the problem of regulatory network inference. As explained in section 1.1.1, living cells are the product of gene expression programs that involve the regulated transcription of thousands of genes. The elucidation of transcriptional regulatory networks in thus needed to understand the cell’s working mechanism, and can for example be useful for the discovery of novel therapeutic targets. Although several methods have been proposed to infer gene regulatory networks from gene expression data, a recent comparison on a large-scale benchmark experiment revealed that most current methods only predict a limited number of known regulations at a reasonable precision level. A second contribution of this thesis is to propose SIRENE, a new method for the inference of gene regulatory networks from a compendium of expression data. The method decomposes the problem of gene regulatory network inference into a large number of local binary classification problems, that focus on separating target genes from unlabeled genes for each TF. SIRENE does not need any assumption on the data and it introduces a new paradigm of inference, based on a biologically meaningful idea: if a gene A is regulated by a TF B, then a gene A’ showing similarity with gene A is likely to be regulated by TF B. SIRENE is thus conceptually simple and computationally efficient. We have tested it on a benchmark experiment aimed at predicting regulations in E. coli, and shown that it retrieves of the order of 6 times more known regulations than other state-of-the-art inference methods.

Identification of disease genes with PU learning

Finally, chapter 4 deals with the application of PU learning to the disease gene identification problem. Recall that elucidating the genetic basis of human diseases is a central goal of genetics and molecular biology. While traditional linkage analysis and modern high-throughput techniques often provide long lists of tens or hundreds of disease gene candidates, the identification of disease genes among the candidates remains time-consuming and expensive. As stressed in section 1.1.2.2, efficient computational methods are therefore needed to prioritize genes within the list of candidates, by exploiting the wealth of informations available about the genes in various databases.
As a third contribution of this thesis, we propose ProDiGe, a novel algorithm for Prioritization of Disease Genes. ProDiGe implements a machine learning strategy based on learning from positive and unlabeled examples, which allows to integrate various sources of information about the genes, to share information about known disease genes across diseases, and to perform genome-wide searches for new disease genes. Experiments on real data show that ProDiGe outperforms state-of-the-art methods for the prioritization of genes in human diseases.

Bagging SVM for transductive PU learning

We now consider the situation where the goal is only to assign a score to the elements of U reflecting our confidence that these elements belong to the positive class. Liu et al. [2002] have studied this same problem which they call “partially supervised classification”. Their proposed technique combines Naive Bayes classification and the Expectation-Maximization algorithm to iteratively produce classifiers. The training scores of these classifiers are then directly used to rank U. Following this approach, a straightforward solution to the transductive PU learning problem is to train any classifier to discriminate between P and U and to use this classifier to assign a score to the unlabeled data that were used to train it. Using SVMs this amounts ato using the biased SVM training scores. We will subsequently denote this approach by transductive biased SVM.
However, one may argue that assigning a score to an unlabeled example that has been used as negative training example is problematic. In particular, if the classifier fits too tightly to the training data, a false negative xi will hardly be given a high training score when used as a negative. In a related situation in the context of semi-supervised learning, Zhang et al. [2009] showed for example that unlabeled examples used as negative training examples tend to have underestimated scores when a SVM is trained with the classical hinge loss. More generally, most theoretical consistency properties of machine learning algorithms justify predictions on samples outside of the training set, raising questions on the use of all unlabeled samples as negative training samples at the same time. Alternatively, the inductive bagging PU learning lends itself particularly well to the transductive setting, through the procedure described in Algorithm 2. Each time a random subsample Ut of U is generated, a classifier is trained to discriminate P from Ut, and used to assign a predictive score to any element of U \Ut. At the end the score of any element x ∈ U is obtained by aggregating the predictions of the classifiers trained on subsamples that did not contain x (the counter n(x) simply counts the number of such classifiers). As such, no point of U is used simultaneously to train a classifier and to test it. In practice, it is useful to ensure that all elements of U are not too often in Ut, in order to average the predictions over a sufficient number of classifiers.

READ Modeling dust emission in the Magellanic Clouds

Table of contents :

Remerciements
Abstract
R´esum´e
1 Context
1.1 Biological context
1.1.1 Gene regulation
1.1.1.1 General mechanisms
1.1.1.2 Why study gene regulation?
1.1.1.3 Experimental characterization
1.1.1.4 In silico inference
1.1.2 The disease gene hunting problem
1.1.2.1 Disease gene discovery
1.1.2.2 Experimental approaches for disease gene identification
1.1.3 Data resources
1.1.3.1 Transcriptomics data
1.1.3.2 Subcellular localization data
1.1.3.3 Sequence data
1.1.3.4 Annotation data
1.1.3.5 Data fusion
1.2 Machine learning context
1.2.1 Learning from data
1.2.1.1 Unsupervised versus supervised learning
1.2.1.2 The bias-variance trade-off
1.2.1.3 Some variants of supervised learning
1.2.2 The Support Vector Machine
1.2.2.1 A geometrical intuition
1.2.2.2 Soft margin SVMs
1.2.2.3 Non-linear SVMs
1.2.3 Kernels methods
1.2.3.1 Motivations
1.2.3.2 Definitions
1.2.3.3 The kernel trick
1.2.3.4 A kernelized SVM
1.2.3.5 Data fusion with kernels
1.3 Contributions of this thesis
1.3.1 A bagging SVM to learn from positive and unlabeled examples
1.3.2 Regulatory network inference
1.3.3 Identification of disease genes with PU learning
2 A bagging SVM for PU learning
2.1 R´esum´e
2.2 Introduction
2.3 Related work
2.4 Bagging for inductive PU learning
2.5 Bagging SVM for transductive PU learning
2.6 Experiments
2.6.1 Simulated data
2.6.2 Newsgroup dataset
2.6.3 E. coli dataset : inference of transcriptional regulatory network
2.7 Discussion
3 Regulatory network inference
3.1 R´esum´e
3.2 Introduction
3.3 System and Methods
3.3.1 SIRENE
3.3.2 SVM
3.3.3 Choice of negative examples
3.3.4 CLR
3.3.5 Experimental protocol
3.4 Data
3.5 Results
4 Prioritization of disease genes
4.1 R´esum´
4.2 Introduction
4.3 Related work
4.4 Methods
4.4.1 The gene prioritization problem
4.4.2 Gene prioritization for a single disease and a single data source
4.4.3 Gene prioritization for a single disease and multiple data sources
4.4.4 Gene prioritization for multiple diseases and multiple data sources
4.5 Data
4.5.1 Gene features
4.5.2 Disease features
4.5.3 Disease gene information
4.6 Results
4.6.1 Experimental setting
4.6.2 Gene prioritization without sharing of information across diseases
4.6.3 Gene prioritization with information sharing across diseases
4.6.4 Is sharing information across diseases beneficial?
4.6.5 Predicting causal genes for orphan diseases
4.7 Discussion
Conclusion
Bibliography