Get Complete Project Material File(s) Now! »
THE FIRST EXPERIENCE OF IMPUTATION IN HUMAN GENETICS WAS BASED ON POPULATION LINKAGE DISEQUILIBRIUM
As often in genetics, human genetics are a few steps ahead. As the first genome to be sequenced, the human genome was studied using different SNP chips. In order to aggregate data coming from different studies, the need for imputation arose. The main focus while using genomic data in human genetics is association studies. Most of the time, the aim is to find a causal mutation in a gene involved in a specific disease. The data includes individuals from small families (compared to large half-sibs families encountered in dairy cattle) and may come from sub-populations usually not related to each other. For these reasons, imputation is based on the linkage disequilibrium observed at the population level.
Two software specifically dedicated to imputation of human data are briefly presented here. There exist many more but these two became standards and are heavily used worldwide.
The first one is fastPHASE, developed by Scheet and Stephens (2006), derived from PHASE (Stephens et al. 2001) allowing the analysis of larger data sets. It is based on a hidden Markov model. The idea is that over short regions, haplotypes tend to cluster into groups. The model specifies a given number K of unobserved states (cluster of haplotypes). The mosaic structure seen in Figure 4 is produced by fastPHASE. Colours can be seen as founders’ haplotypes segregating in the population. The software can be used for both haplotyping and imputation. For imputation, the best guess is sampled from the conditional distribution of the observed genotype given the hidden state. Haplotyping consists in allocating the paternal or maternal origin of a given chromosome segment.
The second one is Beagle, developed by Browning and Browning (2007) and heavily used in the field. It is also based on an hidden Markov model. It has some similarities with fastPHASE, the main difference being that the haplotype-cluster model is localized. While the number K of unobserved states is required as an input in fastPHASE, and remains the same all along the chromosome, this value can differ for every marker position in Beagle. A Directed Acyclic Graph (DAG) is produced and summarizes the LD pattern. It gives the different emission probabilities from one hidden state at one marker position to the possible hidden state at the next position (see Figure 5). The DAG (which can be seen as a special kind of tree where branches can merge) is simplified using the Viterbi algorithm. In Beagle theory, recombination is modeled as merging edges.
CORRELATION BETWEEN TRUE AND IMPUTED GENOTYPES
Most of the authors report concordance or error rates, based on alleles or genotypes. The semeasures are simple to calculate, and easily derived from the other one. There exist some alternative measures that present some better properties. Hickey suggested to use correlation between true and imputed genotypes in order to account for MAF.
As we stated in part 2.1. , “Using concordance rate as imputation efficiency criterion may be misleading as it depends on MAF. The lower the MAF, the higher the concordance rate, for the same efficiency and this should be accounted for in the interpretation. For example, with an average MAF of 0.2, 80% of the results would be correct after random sampling of the missing alleles. Getting a 95% concordance rate corresponds only to a 75% ( = (95-80)/(100-80) ) imputation efficiency. Correlations are an alternative criterion less dependent on MAF.“ One hidden message is that, a marker with low MAF presents less variation across individuals, bringing less information to the genomic model, therefore good concordance rate for such marker is misleading when one wants to predict the information brought from imputed genotypes to the evaluation model. In the study of Hickey et al. (2012), performed on maize data, they plotted both concordance rate and correlation as a function of MAF (Figure 6). One may thereby observe differences between the 2 measures, low MAF markers present high concordance rate but low correlation. We chose to report the 2 kind of measures.
COMPARING GENOMIC EVALUATIONS BASED ON IMPUTED GENOTYPES
In animal breeding, the main use of imputed genotypes is to include them in genomic evaluation. Obviously, the best way to measure the consequences of using imputed genotypes on genomic selection is to run genomic evaluations on both imputed and true genotypes and compare the DGV or GEBV obtained. Phenotypes and complete genotypes of the training population are used, as well as imputed genotypes for the validation population. Genomic breeding values based on imputed and true genotypes can be compared. Moreover, they can be compared to phenotypes (DYD or deregressed proofs of validation animals). One can then estimate the loss in reliability when using imputed genotypes instead of true genotypes.
Imputation of missing SNP markers
Imputation of markers was done using the PHASEBOOK package (Druet and Georges, 2010) in combination with Beagle 2.1.3 (Browning and Browning, 2007). The method was applied as a stepwise procedure using both linkage and LD information. The same procedure as in Zhang and Druet (2010) was applied. First, all markers that can be determined unambiguously using Mendelian segregation rules were phased using the LinkPHASE software. In the first step, both training and test animals were included. An iterative procedure was then applied, where a directed acyclic graph (DAG) describing the haplotype structure of the genome was fitted to the partially phased data from the previous step. This was, however, only done for the reference animals. This was done for 10 iterations and then, the final DAG, the genotype file and the output from LinkPHASE (partially phased data) were used to reconstruct haplotypes and impute missing markers for both test and training animals using the Viterbi algorithm. With Beagle and PHASEBOOK, all markers are imputed, and the method does not leave any missing markers. More details on the imputation procedure can be found in Druet et al. (2010) and Zhang and Druet (2010).
Table of contents :
CHAPTER 1 – Imputation and low density SNP chip
1.1. Presentation of low density panels and imputation
1-1.1. The need of a cheap low density snp panel
1-1.2. Criteria to select markers to include in the low density panel
1-1.3. Defining imputation
1-1.4. Statistical basis of imputation
1.2. The different imputation software
1-2.1. The first experience of imputation in human genetics was based on population linkage disequilibrium
1-2.2. Imputation software specifically dedicated to animal populations, based on linkage and Mendelian segregation rules
1-2.3. Combining the 2 sources of information: population based linkage disequilibrium and linkage using family information
1.3. Measures of imputation accuracy
1-3.1. Error rate or Concordance rate
1-3.2. Counting per genotype or per allele
1-3.3. Correlation between true and imputed genotypes
1-3.4. Comparing phases or genotypes ?
1-3.5. Measuring consequences of imputation errors on genomic evaluations
1-3.6. Comparing genomic evaluations based on imputed genotypes
1.4. Article I Impact of imputing markers from a low density chip on the reliability of genomic breeding values in Holstein populations
1-4.1. Background
1-4.2. Objectives
13
1-4.3. Main results
CHAPTER 2 – Defining the most adapted low density panel
2.1. – Article II Short Communication: Imputation performances of three low density marker panels in beef and dairy cattle
2-1.1. Background
2-1.2. Objectives
2-1.3. Main results
2-1.4. Background and objectives
2-1.5. Main results
2.2. – A brief description of the routine imputation procedure implemented in France
CHAPTER 3 – Preferential treatment and bias in genomic evaluations
3.1. The bias induced by preferential treatment
3-1.1. an old issue in genetic evaluations
3-1.2. Why this old issue is highlighted by genomic selection
3-1.3. Preliminary study
3.2. Article IV Inclusion of cows performances in genomic evaluations and its impact on bias due to preferential treatment
3.3. Strategies applied by different countries regarding preferential treatment
3-3.1. North american consortium
3-3.2. Eurogenomics consortium
3.4. Evidences of a bias in genomic predictions when performances of genotyped cows are explicitly included
3-4.1. Evidences of bias in American genomic evaluations
3-4.2. Two kinds of biases: selected subpopulation and preferential treatment
3-4.3. Evidences of bias in French genomic evaluations
3.5. Possible solutions to deal with biases in the cow population
3-5.1. Discard genotyped cows from the reference population
3-5.2. Another solution: target specific cows for genotyping
3-5.3. Yet another solution: adjust cows performances
3-5.4. A key aspect: identify individuals subject to preferential treatment
CHAPTER 4 – Discussion: Genotyping females and genetic gain
4.1. Measures of genetic gain
4-1.1. The four pathways of genetic gain (Rendel and Robertson, 1950)
4-1.2. Applying Rendel and Robertson’s formula to compare breeding schemes
4-1.3. A stochastic simulation to assess the benefit of genomic selection over progeny-testing
4.2. Genotyping bull dams
4-2.1. A crucial pathway for genetic improvement
4-2.2. Issues related to the use of young animals as parents
4.3. Genotyping on farms: selecting cows to breed cows
4-3.1. Benefit at the national level and return on investment for the farmer
4-3.2. Benefits of genotyping for the dairy farmer
4-3.3. Interests of genotyping cows other than for selection
4-3.4. Discussion on the economic studies of the interest of genotyping heifers
4-3.5. One key aspect: the replacement rate
4-3.6. Practical decisions in herd management favored by genotyping heifers
4-3.7. From a vicious circle to a virtuous circle on functional traits
REFERENCES