Project Details
The future of medical datasets: large-scale analysis and training of medical multimodal models through an iterative ML approach
Applicant
Dr. Robert Kaczmarczyk
Subject Area
Medical Informatics and Medical Bioinformatics
Dermatology
Epidemiology and Medical Biometry/Statistics
Dermatology
Epidemiology and Medical Biometry/Statistics
Term
since 2023
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 526052741
In recent years, the rapid progress in computer science in the field of machine learning has been demonstrated by the publication of ever larger data sets and ever better, larger models based on them. For example, general chat programs can now already represent a broad knowledge base (ChatGPT), or accurate images (Stable Diffusion, Dall-E, Imagen) or even videos (Phenaki) can be generated from text. However, the datasets to use these models are mostly not well studied yet, neither in general nor with respect to medical data. Similarly, there is currently no good overview of existing large, medical datasets on the Internet that can be useful for training medical variations of above models. In our research project, we aim to address exactly this by first characterizing the landscape of medical datasets on the Internet. Using this data, we will then train simple binary classifiers that will enable further, better filtering of existing, large datasets, such as the LAION-5B we published recently (the largest, publicly freely available text-image dataset). The new datasets are then used as a starting point for adapting open models of contrastive learning such as the so-called open_clip, which is a free implementation of openai's CLIP that is ultimately intended to enable better medical data mapping. The resulting dataset will be used to train / fine-tune the above general models (e.g., Stable Diffusion) to generate better images, using dermatology as an example, both for training purposes, but also for a better understanding of dermatological diseases in general. Throughout the research project, all examined and generated datasets, as well as the trained models and their outputs, will be examined for bias with respect to gender, origin, skin color, etc., in order to alert the research field to potentially existing imbalances in (medical) datasets and to promote the creation of more balanced datasets. In doing so, we will develop and make publicly available general metrics that will continue to be useful for assessing the balance of datasets and model outputs in the future.
DFG Programme
WBP Fellowship
International Connection
Canada, USA