Project Details
Development of Kernel-based and ensemble machine learning methods for binary/multi-class processing of environmental LC-HRMS/MS data and multiset modeling of fused data
Applicant
Professorin Maryam Vosough, Ph.D.
Subject Area
Analytical Chemistry
Term
since 2023
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 520243139
The main objectives of this project are as follows: (a) Developing an efficient data mining protocol for the evaluation of complex environmental data sets using DIA (AIF) mode in HPLC-Orbitrap-MS/MS. Due to the high value of accurate-mass MS/MS spectra for elucidation and confirmation purposes, the potential of chemometrics algorithms of multiset/tensor decomposition incorporated into the non-targeted analysis (NTA) workflow to increase MS2 spectral coverage for complex datasets using AIF will be evaluated. These tensor decompositions can be used to map data in low dimensional spaces and to separate variables in all modes of measurement. These processes will be performed on water samples with different degrees of complexity on the LC-MS1 and LC-MS2 data, in the separate mode and then the fused mode (Work Package 1).(b) Employing SVM and RF, in their original form as well as in combination with recursive feature elimination (RFE), for binary/multi-classification of LC-HRMS data of spatial/temporal surface water samples. The objective will be pollutant prioritization/ranking and finding the best subset of features for a predictive model with high accuracy. These classifiers are designed to be exploited in response to previously reported issues such as classifying longitudinal pollution patterns in surface water streams49, the impact of limited replication of water samples and sample size on reproducibility and stability of prioritized pollutants in spatiotemporal environmental studies. The output of the mentioned methods (initial feature ranking, subset of selected pollutants, and classification accuracy) will be compared with the prioritized list of pollutants provided by variable importance on projection (VIP) and selectivity ratio (SR) for PLS-DA. So, a more comprehensive evaluation of spatially/temporally varying surface water samples will be carried out. As a result, pollutants that link to a specific temporal frame can be prioritized and identified. Ultimately, by developing a quantitative method based on authentic standards for highly ranked pollutants, reliable information about the relative concentrations of potentially hazardous pollutants (as "mixture exposure”) in the set of surface water and environmental interpretation will be provided, which can then be utilized for further research (environmental monitoring and risk assessment studies). (Work Package 2).
DFG Programme
Research Grants