HySim: Hybrid-parallele Ähnlichkeitssuche für die Analyse großer genomischer und proteomischer Daten

Antragsteller Professor Dr. Andreas Hildebrandt; Professor Dr. Bertil Schmidt

Fachliche Zuordnung Bioinformatik und Theoretische Biologie
Datenmanagement, datenintensive Systeme, Informatik-Methoden in der Wirtschaftsinformatik

Förderung Förderung von 2016 bis 2021

Projektkennung Deutsche Forschungsgemeinschaft (DFG) - Projektnummer 329350978

Erstellungsjahr 2021

Zusammenfassung der Projektergebnisse

Recent years have seen a tremendous increase in the volume of data generated in the life sciences. The analysis of these datasets poses difﬁcult computational challenges. The objective of the HySim project has been the combination of methods from big data analytics and high performance computing to develop new algorithmic approaches to similarity search for the analysis of large-scale genomic and proteomic data that are computationally efﬁcient and potentially more accurate than the current state-of-the-art. The corresponding data sets are produced by two types of high-throughput technologies: Next Generation Sequencing (NGS) and Mass spectrometry (MS). The main algorithmic approach is based on Locality Sensitive Hashing (LSH) with efﬁcient parallelization on Big Data clusters and multi-GPU systems. Our corresponding initial research questions were: 1. How can we develop and implement novel algorithms for the classiﬁcation of metagenomic reads, metagenomic abundance estimation and read error correction based on an LSH approach? How do we evaluate and compare their performance to the state-of-the-art in terms of accuracy, runtime and memory consumption? 2. How can we develop and implement novel algorithms for feature detection and database search in MS data based on an LSH approach? How do we evaluate and compare their performance to the state-of-the-art in terms of accuracy, runtime and memory consumption? 3. How do we parallelize these algorithms on big data clusters and HPC systems to scale towards large-scale datasets? How do these approaches compare? Our research resulted in a number of key ﬁndings that were published in leading journals and conferences: Metagenomics: We have shown that an LSH-based approach can outperform the state-of-the-art in metagenomics. Our corresponding new tools MetaCache and AFS-MetaCache are able to achieve signiﬁcantly better performance than popular tools such as Kraken2 for simulated as well as for real-world data for both metagenomic classiﬁcation and abundance estimation. In addition, our scalable GPU-accelerated version (MetaCache-GPU) and cluster-based versions (MetaCache-Spark and MPI-MetaCache) achieve order-ofmagnitude speedup compared to existing tools. In addition, we have shown how this approach can be successfully adapted to other tasks such as fast mapping of RNA-Seq Reads to transcriptomes and rapid activation matrix computation for single-cell RNA-seq reads. Error Correction: We have shown that an LSH-based approach can be used to design the ﬁrst highly accurate yet scalable error correction method based on multiple seqeunce aligments (MSAs) by developing CARE. CARE can reduce the amount of false positive corrections by at least an order-of-magnitude compared to existing approaches while delivering similar amounts of true positives. It can also scale efﬁciently towards billions of reads sequenced from complex genomes (such as human) on both CPUs and GPUs. The multi-GPU versions designed in HySim are based on our new Gossip and WarpCore libraries, that can also be applied to a variety of other problem in bionformatics and beyond. Mass Spectrometry: We have developed ﬁle formats and infrastructure to support mass spectrometry analysis on big data clusters. Based on these underlying structures, we have developed a feature detection approach for raw data processing in mass spectrometry that is suited to multi-dimensional data sets. We have designed, implemented, and tested a parametric approach that uses knowledge about expected isotopic distributions. This approach was shown to yield similar results than established techniques. Based on this work, we have created a fully non-parametric feature detection technique based on locality sensitive hashing which is the ﬁrst of its kind. Finally, we have designed and validated LSH-based techniques for database search in mass spectrometry and have concluded that the technique is applicable, but only has signiﬁcant advantages in situations where annotations are intrinsically unreliable.

Projektbezogene Publikationen (Auswahl)

“MetaCache: context-aware classiﬁcation of metagenomic reads using minhashing”. In: Bioinformatics 33.23 (2017), pp. 3740–3748
A. Müller, C. Hundt, A. Hildebrandt, T. Hankeln, and B. Schmidt
(Siehe online unter https://doi.org/10.1093/bioinformatics/btx520)
“Gossip: Efﬁcient Communication Primitives for Multi-GPU Systems”. In: Proceedings of the 48th International Conference on Parallel Processing. 2019, pp. 1–10
R. Kobus, D. Jünger, C. Hundt, and B. Schmidt
(Siehe online unter https://doi.org/10.1145/3337821.3337889)
“Sufﬁx Array Construction on Multi-GPU Systems”. In: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing. 2019, pp. 183–194
F. Büren, D. Jünger, R. Kobus, C. Hundt, and B. Schmidt
(Siehe online unter https://doi.org/10.1145/3307681.3325961)
“A big data approach to metagenomics for all-food-sequencing”. In: BMC bioinformatics 21.1 (2020), pp. 1–15
R. Kobus, J. M. Abuín, A. Müller, S. L. Hellmann, J. C. Pichel, T. F. Pena, A. Hildebrandt, T. Hankeln, and B. Schmidt
(Siehe online unter https://doi.org/10.1186/s12859-020-3429-6)
“Big Data in metagenomics: Apache Spark vs MPI”. In: PLoS One 15.10 (2020), e0239741
J. M. Abuín, N. Lopes, L. Ferreira, T. F. Pena, and B. Schmidt
(Siehe online unter https://doi.org/10.1371/journal.pone.0239741)
“RainDrop: Rapid activation matrix computation for droplet-based single-cell RNA-seq reads”. In: BMC bioinformatics 21.1 (2020), pp. 1–14
S. Niebler, A. Müller, T. Hankeln, and B. Schmidt
(Siehe online unter https://doi.org/10.1186/s12859-020-03593-4)
“WarpCore: A Library for fast Hash Tables on GPUs”. In: 27th IEEE International Conference on High Performance Computing, Data, and Analytics, HiPC 2020, Pune, India, December 16-19, 2020. IEEE, 2020, pp. 11–20
D. Jünger, R. Kobus, A. Müller, C. Hundt, K. Xu, W. Liu, and B. Schmidt
(Siehe online unter https://doi.org/10.1109/HiPC50609.2020.00015)
“CARE: context-aware sequencing read error correction”. In: Bioinformatics 37.7 (2021), pp. 889–895
F. Kallenborn, A. Hildebrandt, and B. Schmidt
(Siehe online unter https://doi.org/10.1093/bioinformatics/btaa738)
“Locality-sensitive hashing enables signal classiﬁcation in high-throughput mass spectrometry raw data at scale”. 2021
K. Bob, D. Teschner, T. Kemmer, D. Gomez-Zepeda, S. Tenzer, B. Schmidt, and A. Hildebrandt
(Siehe online unter https://doi.org/10.1101/2021.07.01.450702)
“MetaCache-GPU: Ultra-Fast Metagenomic Classiﬁcation”. 2021
R. Kobus, A. Müller, D. Jünger, C. Hundt, and B. Schmidt
(Siehe online unter https://doi.org/10.48550/arXiv.2106.08150)
“RNACache: Fast Mapping of RNA-Seq Reads to Transcriptomes Using Min-Hashing”. In: Computational Science - ICCS 2021 - 21st International Conference, Krakow, Poland, June 16-18, 2021, Proceedings, Part I. Ed. by M. Paszynski, et al. Vol. 12742. Lecture Notes in Computer Science. Springer, 2021, pp. 367–381
J. Cascitti, S. Niebler, A. Müller, and B. Schmidt
(Siehe online unter https://doi.org/10.1007/978-3-030-77961-0_31)

Servicenavigation

Hauptnavigation

HySim: Hybrid-parallele Ähnlichkeitssuche für die Analyse großer genomischer und proteomischer Daten

Zusammenfassung der Projektergebnisse

Projektbezogene Publikationen (Auswahl)

Zusatzinformationen

Servicenavigation

Hauptnavigation

HySim: Hybrid-parallele Ähnlichkeitssuche für die Analyse großer genomischer und proteomischer Daten

Zusammenfassung der Projektergebnisse

Projektbezogene Publikationen (Auswahl)

Zusatzinformationen

Textvergrößerung und Kontrastanpassung