Project Details
Semantic Text Analytics for Quality Controlled Extraction of Clinical Phenotype Information in Healthcare Integrated Biobanking STACI2B2
Subject Area
Epidemiology and Medical Biometry/Statistics
Term
from 2016 to 2023
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 315098900
The growing availability of high-quality biomaterials is a prerequisite for sustainable and reproducible results in translational biomedicine. This observation not only holds for exploratory but also, increasingly, for validating research outcomes. However, some skepticism has already been expressed whether the plethora of results from preclinical studies hold their promises when transferred into clinical practice.Consider, as an example, ongoing research on biomarkers. There is an apparent discrepancy between the multitude of studies on novel biomarkers and the number of clinically validated applications. One major problem is the lacking concern for quality differences in ancillary samples and the insufficient validation of potential markers based on comparison collectives with well-defined diseases and comorbidities differing from the target disease.Whereas in the past the infrastructure for high-quality collection and warehousing of such bio samples has been established at many clinical sites by building and maintaining professional biobanks, what is still lacking are routine workflows to sample valid phenotype data, to determine valid comparison collectives and to properly select samples for high-quality biobanking. In our project, we propose to extract such information from clinical documents using methods from automatic natural language processing. We plan to build a text analytics pipeline using semi-supervised machine learning techniques to harvest medically relevant named entities (such as diseases, drugs, diagnoses) and relations among these entities (such as the effectiveness or dosage of medications relative to a disease and a patient, lab and test data for diagnosis, etc.) from unstructured clinical documents (such as discharge summaries, radiology or pathology reports, etc.). Automatic text analysis will thus form the basis for computing medically relevant context data from the documents contained in the clinical information system of the university hospital in Jena and will instantaneously feed structured evaluation procedures for the real-time selection of samples of well-defined collectives of patients when they enter routine laboratories. At the same time, residual material not needed for further diagnosis can be utilized for building up a repository of comparison samples.Such an information extraction system for German-language clinical documents and its integration into routine clinical workflows is currently not available in any German hospital. Moreover, we stipulate that such a combined effort will have far-reaching implications for future progress in translational medicine which goes beyond the exemplary application to determine validated phenotype data, to select well-defined collectives of patients and to produce high-quality bio materials.
DFG Programme
Research Grants
Co-Investigators
Andreas G. Henkel; Sebastian Claudius Semler