Project Details
Project Harvester: Improving molecular fingerprint prediction through self-training
Applicant
Professor Dr. Sebastian Böcker
Subject Area
Bioinformatics and Theoretical Biology
Analytical Chemistry
Analytical Chemistry
Term
since 2023
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 518231245
Rapid annotation of small molecules is of interest in numerous areas of biology and the life sciences. Mass spectrometry (MS) is a key technology for the annotation of small molecules from small amounts of samples. Structural elucidation of small molecules is usually carried out using tandem mass spectrometry (MS/MS). Computational analysis of MS/MS data is one of the major technological hurdles in metabolomics and small molecule research today. In 2015, my group developed CSI:FingerID for searching MS/MS data in molecular structure databases. Later, we developed CANOPUS of the comprehensive assignment of compound classes without the need for structural elucidation. In 2021, we published the COSMIC workflow that allows us to differentiate between correct and incorrect annotations. All of these methods depend on MS/MS data to train the underlying machine learning models. Unfortunately, available reference MS/MS libraries are growing slowly, and much slower than structure databases or publicly available biological data. The fundamental objective of this project is to harness the publicly available biological data to improve our machine learning models. The prediction of molecular fingerprints from MS/MS data of small molecules lies at the heart of many computational methods such as CSI:FingerID, CANOPUS and MSNovelist. The goal of this project is to substantially improve fingerprint prediction performance through self-training, making use of the billions of unlabeled spectra from small molecules available in public repositories. We will process hundreds of thousands of LC-MS/MS runs publicly available in repositories such as GNPS, find high-confidence annotations, feed those annotated MS/MS spectra back into the training data for fingerprint prediction, and repeat until convergence. The impact of our project will be two-fold. Firstly, we can improve the performance of all methods that rely on fingerprint prediction, including CSI:FingerID, CANOPUS and MSNovelist. Second, our project will provide us with a large public library of MS/MS with putative molecular structure annotations. This will not only allow others to train better machine learning models (say, for Competitive Fragmentation Modeling, CFM) but also be of value for computational method development in general.
DFG Programme
Research Grants