Collinearity and its effects – Project topics materials

Get Complete Project Material File(s) Now! »

CHAPTER 3 METHODS

Introduction

The data from six F1, seven F2 and 26 F3 E. grandis trials and 14 F1 and six F2 P. patula trials (discussed in Chapter 2) were analysed in order to assess which matrix inversion technique or adapted ridge regression technique used in Best Linear Unbiased Prediction (BLUP) selection calculations are best at dealing with situations where some degree of collinearity in the data may cause instability and affect some of the components of the models and additionally whether there are differences in results when using computer programmes with different numerical precision.
The methods used in this study are set out in Figure 3.1. The first step was an exploratory phase to assess what data were available for each trial and to choose data sets of similar ages where possible and the data were then edited. Genetic parameters were then estimated for the trials. A check was executed to assess whether there was a potential degree of collinearity in the data by using the phenotypic correlations between the three selection traits used for the study (DBH, height and stem form). Predicted breeding values were estimated in the F1 and F2 E. grandis trials and the F1 P. patula trials using different matrix inversion techniques and an adapted ridge regression technique within BLUP. Realised breeding performance was estimated in the F2 and F3 E. grandis trials and the F2 P. patula trials. Two versions of the same Best Linear Unbiased Prediction (BLUP) software package for unbalanced index selection in tree breeding called Matgen (Verryn and Geerthsen 2006) were developed by a software programmer and were used for the calculation of the breeding values (BLUP index values) in this study. These versions were developed from Matgen 5.1, a programme created by Verryn (1994). Matgen 5.1 was thoroughly tested by Verryn (1994) using simulated data and validated through comparisons with solutions from SAS IML (1988) and RESI 4 of Cotterill and Dean (1990), alternative programmes which were available at the time. These comparisons gave virtually (except for round-off error due to differences in the significant digits of values) identical solutions to Matgen 5.1 (Verryn 1994). In this current study, the predicted breeding values were correlated with the realised breeding performance in the estimation of the accuracy of prediction (see section 3.7 for the supporting theoretical background). Realised genetic gains were calculated for each technique used and for each economic weighting set. Here the relative performance of the progeny in the next generation using the backward BLUP values provided the best available measure for realised genetic gains. The BLUPs resulting from the different matrix inversion methods and ridge regression as well as the two numerical precision programmes, were compared using these measures of accuracy and realised genetic gains. The partial pivoting technique (see section 3.5.1) was used as the control where no collinearity mitigation technique was applied. Partial pivoting also served as a further indication of the potential presence of collinearity (a requirement for the data to be used in the study as mentioned in Chapter 2) in the data sets leading to instability in the calculations of the BLUP index values.

Data exploration

The data exploration step involved an examination of all potential electronic data files of the P. patula and E. grandis breeding programmes of the CSIR to determine the suitability of the data for this study based on the criteria discussed in section 2.2. Data sets were chosen from similar aged assessments within each generation of trials where possible although the main criterion was that the same traits had to be available in each generation of trials.
Selection of trials was also based on available pedigree data detailing the selections made in each generation to plant the next generation of breeding trials.
The assessment traits that were used in this study were diameter at breast height (DBH), stem form and height as data was available for all these traits at suitable assessment ages for each generation of trials.

Data editing

The data analysis was carried out using SAS/STAT software, Version 9.1 of the SAS System for Windows. Copyright © 2002-2003 SAS Institute Inc. The data was edited before any other analysis was executed in SAS. Tests for normality, checks for outliers and missing data were run. Trees with missing observations and outliers (observations that that lay within 1.5 times the interquartile range (IQR) below the 25^th percentile and those that lay 1.5 (IQR) above the 75^th percentile i.e. trees with much smaller or larger DBH, height and stem form values) were deleted from the data sets. The data sets were unbalanced due to mortality in the trials.
The PROC GLM procedure in SAS was used to test for the significance of the fixed effects in the data sets. PROC GLM is a two-way mixed model ANOVA procedure that is suitable to use for unbalanced data. Both continuous variables and variables with discrete categories can be analysed using this procedure. In the data sets the families were considered as random effects and the replications to be fixed effects. The data sets were corrected for fixed replication effects using least squares means (LS-means). The correction reduces the bias of selection from the good performing replications. Following the correction step in SAS the variables appear normally distributed as the normal method of Blom (1958) is used (SAS Institute Inc. 2004). The data sets were all also standardized in order to obtain more normal data, standardizing the variables to a given mean and standard deviation in SAS (method of Blom 1958 cited in SAS Institute Inc. 2004) resulting in the data having a mean of zero and a standard deviation of one. The advantage of having standardized data sets is that data sets from different ages can be compared because the data sets are independent of the ranges of actual values or units of measurements. Another advantage is that it simplifies the interpretation of the relative rankings of individuals in the BLUP index. For example where the trait score is zero it equals the trial average and a score of plus one equals one standard deviation more than the average.
The corrected and standardized values were obtained from the following equation in Narrow-sense heritability was estimated for all of the E. grandis and P. patula breeding trials. The narrow-sense heritability estimates served as verification for the breeding potential of the chosen traits and served as input for BLUP calculations.
The narrow-sense heritability is defined by the following equation as the ratio of the additive genetic variance to the phenotypic variance (Falconer 1989)
The additive genetic variance is not measured directly and is estimated in different ways depending on whether the population consists of full-sibs or half-sibs. All of the trials used in this study consisted of open-pollinated half-sibs. The _a² was expressed in terms of the family variance
The coefficient of relationship (R in equation 3.3) of 0.25 was used for the P. patula trials. This value of 0.25 has historically been used in calculating heritabilities in the P. patula trials that form part of the CSIR breeding programme and was kept as such in this study. It is believed that there was little or no inbreeding or selfing in these trials. This assumption has since been questioned (Kanzler 2002; Stanger 2003; Vermaak 2007). In the open-pollinated E. grandis trials a degree of selfing (or related crossing) is expected in the trials. In a study using comparisons between heritabilities of open-and control-pollinated E. grandis in the same trials, it was suggested that there may be as much as twenty percent natural selfing in open-pollinated populations (Verryn 1993). Verryn (1993) therefore recommended that the coefficient of relationship should be increased from 0.25 to 0.30 for half-sibs in selection and heritability procedures of open-pollinated populations of E. grandis. Similar results for selfing in E. grandis were found by Griffin et al. (1987); Griffin & Cotterill (1988); Hodgson (1976a) and Hodgson (1976b). Based on the above mentioned recommendations, a coefficient of 0.3 was used for the E. grandis trials in this study.
The Mixed Model Least-Squares and Maximum Likelihood programme (LSMLMW & MIXMDL PC-2 Version) developed by Harvey (1990a) was used to estimate the genetic variance components needed for BLUP index calculations and to calculate the narrow-sense heritabilities for the assessed traits in each trial.
Two model options of the LSMLMW programme of Harvey (1990a and 1990b) are most frequently used in tree breeding trials, namely model two and model six. Model two of the programme may be used for trials which have single tree plots and for which there are no family-replication interaction effects (Harvey 1990a, Harvey 1990b). In this study model two was used in the E. grandis trials and the P. patula F2 trials- which were single tree plots trials- and model six for the F1 P. patula trials which had multiple tree plots.

READ schedulability analysis of tree-shaped transactions with non-immediate tasks

Prediction of individual breeding values

Individual breeding values (forward prediction) were predicted in the F1 population data of the P. patula and E. grandis trials as well as the F2 population data of the E. grandis trials using BLUP. A Best Linear Unbiased Prediction (BLUP) software package for unbalanced index selection in tree breeding called Matgen (Verryn & Geerthsen 2006) was chosen for the purpose of this study. Although other software programmes exist for the calculation of BLUP values in forestry data such as ASReml (Gilmour et al. 2009) and TREEPLAN (Kerr et al. 2001) it was decided to use Matgen for this study as adaptations to the programme could easily be made to allow for various options of matrix inversion and to test for the effect of different numerical precision. As the generalised least squares means correction for fixed effects is used and thus effectively the BLUE (Best Linear Unbiased Estimates) values are input into Matgen, the solution for the predicted breeding values is therefore the BLUP solution (White & Hodge 1989; Verryn 1997).
The following equation (shown in matrix format) from White and Hodge (1989) is used in calculating the breeding values in the Matgen programme
The predicted breeding value thus combines information from all traits of interest into a single index value and all traits under consideration have economic weights and genetic information attached to them. The choice of suitable economic weights is important as it affects the efficiency of the index (Falconer 1989; Zobel & Talbert 1984).
In order to test the effect of differences in numerical precision, two versions of Matgen were used. One version of Matgen (Matgen5n) was written in DOS-based Clipper (Computer Associates 1993) and has 16-bit computational numerical precision. This Clipper version was a modified version of the Matgen 5.1 (Verryn 1994) programme. The other version of Matgen (Matgen 7.2) was written in Borland Delphi and has 32-bit computational numerical precision. The analytical and mathematical procedures in both the Clipper and the Delphi programmes were identical, the only key difference being the operational level of numerical precision.
In this study different techniques for the mitigation of collinearity were included in Matgen (different matrix handling techniques for the inversion of the V matrix) in the calculations of the BLUP values (predicted ĝ values). The matrix inversion techniques used in the calculations of the BLUP values were Gaussian elimination (Gauss-Jordan method Press et al. 1992) with partial pivoting (referred to as Matgen 56 subroutine, Verryn 1994; and served as control), Gaussian elimination (Gauss-Jordan method) with full pivoting (Press et al. 1992) and singular value decomposition (SVD) (Press et al. 1992). An adaptation of ridge regression (Hoerl & Kennard 1970a) method was also included in the Delphi Matgen programme. The SVD method was also only used in the Delphi Matgen programme.

Gaussian elimination (Gauss-Jordan method)

Gauss-Jordan is an efficient method for inverting a matrix and is also as stable as any of the other direct methods (Press et al. 1992). The sequence of operations that are performed in Gauss-Jordan elimination is very closely related to those in other routines such as singular value decomposition (Press et al. 1992).
Gauss-Jordan elimination uses one or more operations such as interchanging of any two rows, interchanging of any two columns and replacing of a row by a linear combination of itself and any other row, to reduce a matrix to the identity matrix (matrix with diagonal elements all equal to one and all other elements equal to zero) (Press et al. 1992). In Gauss-Jordan elimination the elements above the diagonal are made zero at the same time that zeros are created below the diagonal and the diagonal of ones is made at this time too (Gerald & Wheatley 2004).

Declaration
ACKNOWLEDGEMENTS
Table of contents
List of tables
List of figures
Summary
Nomenclature and abbreviations
Definitions
Chapter 1 INTRODUCTION
1.1 A brief history of selection index
1.2 Collinearity and its effects
1.3 Detecting collinearity or diagnostics for collinearity
1.4 Methods of handling or coping with collinearity and resulting instability
1.5 Objective of the study
Chapter 2 MATERIALS
2.1 Introduction
2.2 Criteria for selection of data
2.3 Description of data sets
Chapter 3 METHODS
3.1 Introduction
3.2 Data exploration
3.3 Data editing
3.4 Estimation of genetic parameters
3.5 Prediction of individual breeding values
3.6 Realised breeding performance
3.7 Accuracy of predicted and realised breeding performance
3.8 Realised genetic gains
3.9 Comparison of different calculation methods
Chapter 4 RESULTS: EUCALYPTUS GRANDIS TRIALS
4.1 Introduction
4.2 Data Editing
4.3 Estimation of genetic parameters
4.4 Predicted breeding values
4.5 Realised breeding performance
4.7 Rank correlation comparisons
4.8 Realised genetic gains
Chapter 5 RESULTS: PINUS PATULA TRIALS
5.1 Introduction
5.2 Data Editing
5.3 Estimation of genetic parameters
5.4 Predicted breeding values
5.5 Realised breeding performance
5.6 Accuracy of predicted and realised breeding performance
5.7 Rank correlation comparisons
5.8 Realised genetic gains
Chapter 6 DISCUSSION OF RESULTS
6.1 Predicted breeding values
6.2 Rank correlations .
6.3 Comparison of the accuracy (inter-generational correlations) of BLUPs (rfb)
6.4 Impact on realised genetic gains
Chapter 7 CONCLUSION
7.1 Main findings
7.2 Recommended future research
REFERENCES
APPENDIX
GET THE COMPLETE PROJECT