Project Details
Projekt Print View

Integration of prior biological knowledge into survival models for different types of omics data

Applicant Dr. Kai Kammers
Subject Area Epidemiology and Medical Biometry/Statistics
Term from 2013 to 2015
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 240819500
 
Genes, lifestyle, and environment are three well-recognized factors influencing human health. The underlying biological pathways explaining the variability from the genome to phenotypes of health and disease are still not well understood. An important application of high-dimensional gene expression measurements is the risk prediction and the interpretation of the genetic variables in the resulting survival models. A major problem in this context is the typically large number of genes compared to the number of observations (individuals). Feature selection procedures can generate predictive models with high prediction accuracy and at the same time low model complexity. To increase the performance and interpretability of these models, integration of prior biological knowledge is helpful. The Gene Ontology (GO) database providing prior biological gene group information is particularly suitable for this task. Expression profiles within gene groups are, however, often heterogeneous. There exists a promising method for tackling this problem, and thus, obtaining subgroups with coherent patterns: Applying preclustering to genes within predefined groups according to the correlation of their gene expression levels yields to improved prediction performance compared to models built with single genes or gene groups.Besides gene expression data, there exist other data sources containing genomic, epigenetic and proteomic information. Thus, this research project focuses on extending the preclustering approach to other high-dimensional data. In a first step, results for the preclustering approach will be compared to three group-integrating methods from the literature (Group Lasso, Sparse Lasso, and CoxBoost). There will also be an extensive comparison of different databases providing prior biological gene group information (GO, KEGG, MSigDB, and PANTHER). As a next step the experience with survival models for gene expression data will be transferred to genome-wide association studies providing high-dimensional single nucleotide polymorphisms (SNPs) data. Models will be developed that can deal with sets of SNPs by employing group-wise approaches to SNPs belonging, e.g., to gene or pathway. Subsequently, a focus will be on building and evaluating statistical models to gain a deeper insight into health and diseases by using an integrated approach to mine a unique and rich database of genomic, epigenetic, proteomic, and phenotypic information. This topic will also include peptide mapping results provided by the new SWATH technology. For evaluation and publication all developed statistical methods will be tested on simulated data sets and applied to existing real world data sets.
DFG Programme Research Fellowships
International Connection USA
 
 

Additional Information

Textvergrößerung und Kontrastanpassung