Project Details
Ultra-fast haplotype phasing and genotype imputation service using a hybrid FPGA-GPU system
Subject Area
Data Management, Data-Intensive Systems, Computer Science Methods in Business Informatics
Bioinformatics and Theoretical Biology
Epidemiology and Medical Biometry/Statistics
Computer Architecture, Embedded and Massively Parallel Systems
Bioinformatics and Theoretical Biology
Epidemiology and Medical Biometry/Statistics
Computer Architecture, Embedded and Massively Parallel Systems
Term
from 2017 to 2022
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 351403079
Large-scale international community projects such as the United States 1-million-volunteer health study were launched to sequence the genomes of more than one million individuals. A major use of this key DNA reference data is to allow phasing and imputation, i.e. estimating personal haplotypes and predicting missing genotypes of individual genome-wide data of diverse clinical biobanks worldwide. However, utilizing the most recent haplotype reference population of more than 32,000 European individuals keep even powerful dedicated cluster systems, such as the Sanger Imputation Service (SIS), running for days. Faster and more energy-efficient computational architectures are required to reduce the computational requirements by several orders of magnitude in order to actually allow for all new practical benefits from using diverse sets of worldwide reference populations and larger reference panels.We propose to provide a phasing and imputation webservice based on the development of an ultra-fast haplotype phasing and genotype imputation hybrid FPGA-GPU system to enable ultra-fast phasing and imputation from diverse sets of reference populations. Targeting a hybrid composition of Field-Programmable Gate Arrays (FPGAs) and Graphical Processing Units (GPUs) introduces a new promising field of scientific research for algorithmic design. Besides, we will algorithmically improve the HapHedge data structure of the SIS to develop another fast lookup algorithm perfectly suited to parallel use and FPGA structure. For haplotype phasing and imputation based on the recent Eagle v2 and PBWT tools our pessimistic, i.e. lower bound, expectations already show a speedup of at least 164 and 142 on one single FPGA processor as compared to a 16-core CPU machine and the Sanger Imputation Service, respectively, thereby reducing the runtime for a medium-size genome-wide data set from days to minutes. Thus, for phasing and imputation, a single standard computing system with only four FPGA processors, as we plan to provide for our service, is as powerful as a large HPC computing cluster of more than 650 16-core CPU cores for parallel computing, while saving more than 99% of energy, and that does not even take our planned GPU integration into account.Further, we propose to implement and provide the research community with a freely accessible graphical decision-making web-interface. We already designed a "conversational interface" prototype to enable quick and easy phasing and imputation from diverse sets of reference panels in conjunction with an "on-the-fly" estimation of phasing and imputation accuracy during runtime. The hybrid FPGA-GPU-based phasing and imputation service will open up entirely new possibilities for applied genetic research, e.g. toggling between imputation panels from diverse sets of worldwide reference populations, to ultimately maximize phasing and imputation accuracy on the individual level - an important prerequisite for personalized medicine.
DFG Programme
Research Grants