Linear model tools for high-throughput gene expression data
Final Report Abstract
In this project, we developed linear model tools for the analysis of high-throughput data. One of the main tasks in the routine application of linear models is to check the underlying assumptions including normality and homogeneity of variance. These assumptions may be checked using quantile-quantile (QQ) plots and plots of residuals against predicted values PRPV-plots). It is sometimes difficult, however, to decide whether a departure is present or not. For example, QQ-plots tend to fan out towards the ends and a somewhat increased scatter at the ends may be well in line with the normality assumption. To aid interpretation of diagnostic plots, we developed a simulation-based procedure for delineating tolerance bands. This was first done for the linear model and then extended to linear mixed model. The methods were illustrated using various datasets, including some high throughput datasets of our collaborators. A further assumption of linear models is additivity of effects. This is particularly important in the used of blocked experiments, where the analysis model assumes additivity of effects for treatments and blocks. We developed a clustering-based significance test for detecting nonadditivity. In simulations this was shown to compare favourably with competing tests. None of the investigated tests, however, was uniformly most powerful. When assumptions are violated for the linear model, a common remedy is to undertake a data transformation. We considered various families of data transformations for proportions and unbounded count data as arising in high-throughput data. Where possible, estimation of the transformation parameter was embedded in a likelihood framework. In case of transformations involving the shifted logarithmic function, however, we adopted an objective function based on the skewness and kurtosis of the residuals. This was shown to work well using several real examples. Generalized linear mixed models (GLMM) provide a powerful extension to linear mixed models by allowing distributions other than the normal and by introduction of a link function that links the conditional expectation of the response to a linear predictor. GLMMs are particularly relevant for high-throughput data because such data often involve counts. The choice of link functions poses a similar problem as the choice of data transformation. We review several families of link functions, which share the desirable property of near orthogonality of the transformation parameter and parameters of the linear predictor. This property means that we can first optimize the value of the link parameter and then perform inference on the effects in the linear predictor, ignoring the fact that the link parameter had to be estimated. The near orthogonality ensures that the loss of information by this approximation is minimal. We illustrate these link functions using an interesting proteomics dataset provided by our coauthors. The dataset was first analysed assuming a binomial distribution of the response. Careful inspection of the residuals revealed, however, that the variance function deviated from that of the binomial distribution. Thus, a pseudo-likelihood approach was adopted that allowed modification of the variance function. Simultaneous estimation of the variance and the link function was done using an objective function sensitive to violations of the assumed variance function. This project has also led to five publications led by our collaborators (Hochholdinger group, Bonn). In this collaborative work we have helped to develop near optimal experimental designs accounting for all phases of the experiment (growth chamber, laboratory phase etc) and to analyse the resulting high-throughput data using our developed linear model tools.
Publications
-
(2012): Checking assumptions of normality and homoscedasticity in the general linear model. Communications in Statistics – Simulation and Computation 41, 141-154
Schützenmeister, A., Jensen, U., Piepho, H.P.
-
(2012): Residual analysis of linear mixed models using a simulation approach. Computational Statistics and Data Analysis 56, 1405-1416
Schützenmeister, A., Piepho, H.P.
-
(2015): A clustering-based test for nonadditivity in an unreplicated two-way layout. Communications in Statistics – Simulation and Computation
Malik, W.A., Möhring, J., Piepho, H.P.