Project Details
Convex space learning for synthetic data generation on clinical tabular datasets
Applicant
Professor Dr. Olaf Wolkenhauer
Subject Area
Medical Informatics and Medical Bioinformatics
Term
since 2022
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 515800538
Synthetic data generation is gaining prominence in biomedical research in solving practical problems: personalization; underrepresentation of groups in clinical trials; data privacy hindering sharing of data among institutions etc. Synthetic data generation using deep generative networks for medical images is a booming research field. For image datasets, there is a perceptional advantage, in the sense, that one can visually judge how realistic the synthetic image is, just by looking at it. However, in biomedical science, tabular datasets are a very common way of storing patient data, and for such data the advantage of visual perception is limited. Since 2017, researchers have focused on developing deep generative models for tabular datasets. Over the last three years, we have developed expertise in tabular synthetic data generation to solve the problem of imbalanced classification. We developed multiple algorithms in the domain of oversampling-driven imbalanced classification and tested their applicability to biological problems such as rare-cell annotation from single-cell transcriptomics data. From our studies emerged the idea of convex space learning, whose theoretical foundations were also explored in our studies. With our newest convex space learning model ConvGeN, we were able to improve classification on tabular imbalanced datasets using synthetic sample generation, compared to the state-of-the-art deep generative algorithms designed for tabular datasets. Synthetic samples generated using ConvGeN can approximate feature-wise statistical distributions better compared to existing deep generative algorithms for tabular datasets since the synthetic samples from ConvGeN fix feature-wise means in tabular data while learning appropriate feature-wise higher-order moments in a non-linear iterative fashion. We argue that convex space learning has extensive potential outside the domain of imbalanced classification that we have explored so far. We propose to extend our model ConvGeN, enabling it to generate synthetic tabular data outside the context of data imbalance. Furthermore, we propose to investigate the potential use of the synthetic data generated using convex space learning for several applications of machine learning in the clinical domain such as patient stratification, classification, regression problems, etc. The goal is to establish whether a given machine learning workflow involving synthetic data generation can produce similar enough performance as using real data, e.g. in patient stratification. Finally, we propose to use the developed algorithm for synthetic sample generation in real-life clinical problems to solve issues like privacy preservation in association with our clinical partners.
DFG Programme
Research Grants