Project Details
Projekt Print View

TMF - Standards and tools for data monitoring in observational studies – Assessing data quality of texts

Subject Area Medical Informatics and Medical Bioinformatics
Term since 2024
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 538742134
 
A major challenge for data quality assessments in observational health studies is the scope and complexity of the targeted data. While there are a range of data quality frameworks and tools to assess data quality, their main focus is on numerical data. Currently, there is neither a comprehensive conceptual handling of data quality issues in text fields nor a standard method for measuring it, despite the high susceptibility of text for data quality issues. Therefore, the key objective of this project is to create a basis for a fully integrated data quality assessment pipeline for commonly used text types of relevance in observational health research (structured: e.g., diagnostic codes like ICD-10, ATC-Codes; semi-structured: e.g., JSON data exports; unstructured, e.g., medical reports, open responses to surveys). The starting point is a quality framework for observational studies with its corresponding analysis package in R, dataquieR. This will be enriched using cutting-edge natural language processing (NLP) approaches. Four goals will be pursued in this project: First, to adapt and expand a data quality framework for observational health studies meeting the specific requirements of data quality assessments in text fields. Second, to improve metadata handling to control the automated assessment of text fields. Third, to implement and evaluate text-related data quality checks in the dataquieR package. Fourth, to develop application-focused learning materials to better engage users. A review of reference frameworks and methodological works addressing structured, semi-structured and unstructured text will guide the further evaluation of our concept. Based on this, we will generate information models in alignment with existing standards to represent knowledge, expectations and requirements on text fields in a machine-readable manner to control automated data quality assessments. The extensions to the data analysis toolbox will primarily be done in R and Python. The latter is essential to include powerful existing NLP libraries for text analysis. Learning materials will be made available on a central website. To develop concepts and tools with our cooperation partners, we will rely on existing text corpora, including observational studies (Study of Health in Pomerania, Dementia Agitation Ontology and Dementia Forum Texts) as well as data from routine care (e.g., cooperation with the Medical Informatics Initiative). The data privacy implications of assessing text data will be explored in collaboration with the TMF. Networking activities with key German projects and targeted workshops will support the dissemination of our findings and software. Overall, this project will improve the basis for an efficient and transparent handling of data quality issues with regard to text fields in the health sciences.
DFG Programme Research Grants
International Connection United Kingdom, USA
Co-Investigator Dr. Johannes Drepper
Cooperation Partners Clair Blacketer; Dr. Emma Tonkin
 
 

Additional Information

Textvergrößerung und Kontrastanpassung