Project Details
Interpretable Feature Extraction from Large-Scale Medical Image Data Through Weak Text Supervision: Bridging the Gap Between Abstract Features and Human Language
Applicant
Dr. Philipp Wesp
Subject Area
Medical Informatics and Medical Bioinformatics
Methods in Artificial Intelligence and Machine Learning
Radiology
Methods in Artificial Intelligence and Machine Learning
Radiology
Term
since 2024
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 553239084
Machine learning (ML) can be considered the state-of-the-art approach for automated image processing and analysis tasks, including applications in radiology. Despite its potential, clinical adoption of ML remains challenging. From a user perspective, one major barrier is the limited interpretability of the abstract image features extracted by typical ML models. While this property does not directly impact the performance of currently deployed ML models, it can substantially limit the comprehensibility of model outputs and predictions, as they are directly calculated from the extracted features. The interpretability barrier can also limit human-machine interaction, which, in a clinical setting, is essential for establishing transparency and trust. An important component for overcoming the interpretability barrier is the integration of human language and image information. Here, recent research developments have provided proof-of-concept in the form of vision-language models like CLIP (Contrastive Language-Image Pretraining). They are trained unsupervised, i.e. without a ground-truth reference, on extremely large datasets and stand out for their ability to process and interpret both text and images effectively. Another crucial component is incentivizing ML models to learn disentangled, structured feature representations. However, existing literature highlights that learning such representations is impossible without an inductive bias, but can be achieved through weak supervision, i.e. by using noisy, incomplete, or imprecise labels for training the ML model. Here, text or semi-structured information, such as medical reports or imaging metadata in DICOM headers, can serve as weak supervision signals. We believe that the vision-language model CLIP can learn the much-needed structured, interpretable feature representations. We propose a method to train CLIP in the context of language and under weak supervision, using image-characterizing feature vectors generated from medical reports and DICOM headers. Afterwards, we will measure the interpretability of the learned representations and explore their capabilities in clinical application situations. In summary, this project aims to bridge the gap between the abstract feature representations typically learned by ML models and the language used by medical professionals. Enhancing the interpretability of ML outputs and predictions is crucial for ML models to effectively assist radiologists and be integrated into clinical practice in the future.
DFG Programme
WBP Fellowship
International Connection
USA