Mapping cancer and disease gene duplications on Ensembl duplication nodes .

Get Complete Project Material File(s) Now! »

Input genomes, orthologs and paralogs (6.1A)

To identify ohnologs retained from the 2R-WGD, we used six invertebrate genomes, one lancelet (cephalochordate): Amphioxus (Branchiostoma floridae), two tunicates (urochordates): Ciona intestinalis and Ciona savignyi, an echinoderm: sea urchin (Strongylocentrus purpuratus), and two basal bilaterians: fly (Drosophila melanogaster) and worm (Caenorhabditis elegans) as outgroups. Using these outgroups we identified ohnologs in six completely sequenced vertebrate (tetrapod) genomes: human (Homo sapiens), chicken (Gallus gallus), dog (Canis lupus familiaris), pig (Sus scrofa), mouse (Mus musculus) and rat (Rattus norvegicus) (Figure 6.2).

Protein coding genes and their genomic coordinates

We limited ourselves to protein coding genes. Except for sea urchin and Amphioxus, the protein coding genes and their genome positions were obtained from Ensembl version 70 [Flicek et al., 2013] using BioMart. Sea urchin and Amphioxus genes and their genome coordinates were downloaded from Ensembl Metazoa [Kersey et al., 2012] and DOE Joint Genome Institute (JGI) [Putnam et al., 2008] respectively. We further excluded genes belonging to unassembled scaffolds or haplotype regions in the vertebrate genomes. Each outgroup and vertebrate genome was then represented by a list of gene identifiers sorted on the basis of their start positions on their respective chromosome.

Identification of synteny blocks and anchors (6.1B)

A synteny block is defined as a region between an outgroup and a vertebrate genome (Figure 6.3 A), or two regions within the same vertebrate genome (Figure 6.3 B) having multiple homologous gene pairs. Between the genomes of two species such blocks represent conserved genomic regions descended from their last common ancestor. Within the genome of the same organism, synteny blocks represent duplicated sister regions, provided the duplication time of the genes residing on such blocks is the same. Vertebrate WGDs are among the oldest known genome duplications and the conservation of gene order or collinearity is limited [Putnam et al., 2008]. However, conservation of macro- or content-based synteny can be observed between genomic regions, where there is a statistical enrichment of orthologs, even after more than 500 million years of independent evolution since the divergence between vertebrates and invertebrates. We used a window based approach to detect such regions between outgroup and vertebrate genomes extending earlier similar approaches [Dehal and Boore, 2005; Makino and McLysaght, 2010]. Any two regions between an outgroup and a vertebrate genome were considered to be candidate syntenic regions if there were at least m orthologous gene pairs between them, within a window of size W, where 2 · m · W. We scanned the genomes of invertebrate and vertebrate organisms by placing a symmetric window around the ortholog genes in each genome in such a way that there are W/2 genes upstream and downstream (Figure 6.3 A). Hence, the ortholog partner under consideration was at the center of the windows in each genome. If there were at least 2 ortholog pairs in this window W, including the central pair, it was considered to be a synteny candidate (necessary but not sufficient condition). All such blocks were identified genome-wide and were labelled by the ortholog pair at the center of the blocks, referred to as the anchors (O7–V7,O7–V0
7, (Figure 6.3 A). At the chromosome boundaries, we kept the window size fixed by extending it in the opposite direction and making the window asymmetric around the anchor gene to avoid biasing the calculation of synteny P-value as described in the next section.

Calculation of P-value to rule out spurious synteny (6.1C)

Since we resort to content based synteny and there is no bound on W and m, it is important to establish that the observed synteny is not just by chance especially for larger W and small m. Given the ortholog (or paralog) relations and location of the homologous gene on the outgroup and vertebrate genomes, we calculate the probability of finding at least k homologous genes by chance in a window (P¸k ´ P ¡value) for all identified anchors, where k is the number of observed outgroup genes (excluding anchor) having orthologs in the vertebrate window under consideration. For any gene i in a given synteny block (e.g. i Æ O5 : O9; Figure 6.4),we first calculate the probability Pi of finding at least one ortholog of gene Oi by-chance in a given window of size Wp in the vertebrate genome as follows Pi Æ 1¡ X (ls ¡Wp Å1).

Identify putative ohnolog pairs (6.1D)

Due to two rounds of genome duplication in the vertebrate genome, each synteny window in the outgroup genome should ideally correspond to up to four windows in the vertebrate genome, however, only a few ohnologs are retained in È 2 copies. To identify such candidate ohnolog pairs, we identify anchors in the vertebrate genome that share the same outgroup gene (e.g. O7 ¡V7 and O7 ¡V0 7, Figure 6.3 A). Vertebrate genes belonging to such anchors thus become the ohnolog candidates. Yet, at this step the duplication time of ohnolog candidates can be incorrect. In fact, due to our very relaxed synteny criteria, many of these candidate ohnologs are not duplicated at the correct time in Ensembl. Therefore, we already exclude such putative ohnolog pairs that are not duplicated at the base of vertebrates, or if the pair does not exist i Ensembl. In later steps, we will further filter these ohnolog candidate pairs by combining their P-values (see Section 6.5).
For vertebrate genome self-comparison, since we have restricted ourself to paralogs at the base of vertebrates, anchors can be directly taken to be the candidate ohnologs if they pass our probability filters.

READ Modeling of EEG power spectrum over frontal and occipital regions during propofol sedation

Table of contents :

Acknowledgements
Abbreviations & Definitions
I Introduction
1 Preamble
1.1 Résumé de la thèse
1.2 Thesis summary
1.3 Organization of the thesis
1.4 Publications resulted/forthcoming from this thesis
2 Evolution by Gene Duplication
2.1 Mechanisms of gene duplication
2.1.1 Unequal crossing over
2.1.2 Retroposition
2.1.3 Non-homologous mechanisms
2.1.4 Whole genome duplication
3 Evolutionary Constraints & Retention of Duplicated Genes
3.1 Neofunctionalization
3.2 Subfunctionalization
3.3 Buffering against deleterious mutations
3.4 Dosage balance hypothesis
4 Whole Genome Duplications & Evolution of Vertebrates
5 Objectives of This Thesis
II Materials & Methods
6 Identification of Ohnologs
6.1 Input genomes, orthologs and paralogs
6.1.1 Protein coding genes and their genomic coordinates
6.1.2 Orthologs and paralogs
6.2 Identification of the synteny blocks and anchors
6.3 Calculation of P-value to rule out spurious synteny
6.4 Identify putative ohnolog pairs
6.5 Combine P-value from anchors
6.6 Sample genomes with multiple window sizes
6.7 Combine P-value from all outgroups
6.8 Filter ohnolog pairs to remove false positives
6.9 Construction of ohnolog families
6.10 Randomization of the human genome
6.11 Small Scale Duplicates (SSD)
6.12 Ohnologs in the teleost fish genomes: the 3R-WGD
6.12.1 The 2R-WGD
6.12.2 The 3R-WGD
6.13 Development of the OHNOLOGS server
7 Collection of Cancer/Disease Genes & Functional Genomic Data
7.1 Cancer genes
7.1.1 Oncogenes and tumor suppressors
7.1.2 “Core” cancer genes
7.2 Dominant & recessive disease genes
7.3 Haploinsufficient and dominant negative genes
7.4 Genes with autoinhibitory protein folds
7.5 Genes coding for protein complexes
7.5.1 Human protein reference database
7.5.2 Comprehensive resource of mammalian protein complexes
7.5.3 Gene ontology
7.5.4 Census of soluble human protein complexes
7.5.5 Permanent complexes
7.6 Essential genes
7.6.1 Human orthologs of Mouse essential genes
7.6.2 Human essential genes from In-vitro knock-out experiments
7.7 Genes with copy number variations
7.8 Expression Level
7.9 Disease genes in other vertebrates
7.9.1 Mouse
7.9.2 Rat
7.10 Analysis of ohnologs conservation using Ka/Ks ratios
8 Causal Inference Analysis
8.1 Mediation Analysis
8.1.1 Total, direct & indirect effects
8.1.2 Mediation calculations
8.1.3 Interpretation of Mediation results
8.1.4 Application on genomic properties
III Results
9 Characterization of Vertebrate Ohnologs
9.1 Combining information from Multiple outgroups improves ohnolog detection
9.1.1 Comparison with randomized human genome
9.2 Comparison of ohnologs with published datasets
9.3 Ohnolog family size distribution
9.4 Ohnolog pairs for other vertebrates
9.5 The OHNOLOGS server
9.5.1 Search
9.5.2 Interpretation of an ohnolog family
9.5.3 Browse & Download
9.6 Ohnologs in the Teleost fish genomes
9.6.1 Ohnologs from the 2R-WGD
9.6.2 Ohnologs from the 3R-WGD
10 Enhanced Retention of “Dangerous” Genes by WGD
10.1 The Majority of “dangerous” genes retain more ohnologs
10.1.1 Ohnolog–disease association is consistent for high confidence ohnolog datasets
10.1.2 Enhanced retention of “dangerous” ohnologs in Mouse & Rat genomes .
10.2 “Dangerous” genes show no biased retention by SSD or CNV
10.2.1 Small scale duplicates from Ensembl
10.2.2 Small scale duplicates from sequence comparisons
10.2.3 Ohnolog and SSD retention bias in different human primary tumors
10.3 Mapping cancer and disease gene duplications on Ensembl duplication nodes .
10.4 Ohnologs are more conserved than non-ohnologs
10.5 Dominant, and not recessive disease genes have retained more ohnologs
10.5.1 Recessive disease genes
10.5.2 Essential genes
11 Dosage Balance, Expression level & Human Ohnologs
11.1 Mixed susceptibility of human ohnologs to dosage balance
11.1.1 High retention of protein complexes in ohnologs
11.1.2 Transient versus permanent complexes
11.1.3 Susceptibility of human protein complexes to disease mutations
11.2 Gene expression level and human ohnologs
11.3 Sequence conservation and ohnolog retention
12 Indirect Causes of Ohnolog Retention
12.1 The effect of dosage balance is mediated by mutation susceptibility
12.1.1 Mediation of ‘Dosage.Bal.’ Æ) ‘Ohnolog’ by ‘Delet.Mut.’ genes
12.1.2 Mediation of ‘Delet.Mut.’ Æ) ‘Ohnolog’ by ‘Dosage.Bal.’ genes
12.1.3 Mediation of ‘Dosage.Bal.’ Æ) ‘Ohnolog’ by ‘Delet.Mut.’ genes after excluding SSD and CNV genes
12.1.4 Mediation of ‘Delet.Mut.’ Æ) ‘Ohnolog’ by ‘Dosage.Bal.’ genes after excluding SSD and CNV genes
12.2 Small effect of essentiality on ohnolog retention
12.3 Negative causal effect of high expression on ohnolog retention
12.4 Sequence conservation & ohnolog retention
12.4.1 Mediation with low Ka/Ks values
12.4.2 Mediation with high Ka/Ks values
13 Population Genetic Model for the Retention of “Dangerous” Ohnologs
IV Discussion & Perspectives
14 Discussion & Perspectives
A Articles
List of Figures
List of Tables
Bibliography