Project Details
Comprehensive repository of regulatory genomic features and their role in human disease
Applicants
Professor Dr. Ulf Leser; Professor Dr. Dominik Seelow
Subject Area
Human Genetics
Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
Bioinformatics and Theoretical Biology
Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
Bioinformatics and Theoretical Biology
Term
since 2019
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 400728090
The study of regulatory DNA elements has a long tradition in biomedical research. A flood of data exists, ranging from isolated measurements of single gene activities to functional studies and internationally coordinated genome-wide investigations. A comprehensive and high-quality overview of the current state of knowledge regarding gene regulation in humans is an important prerequisite for planning future experiments. However, the results of targeted, high-quality experiments are only published in scientific articles. In turn, regulatory data collected by high-throughput experiments are scattered across a large number of databases. We now aim to develop and make available to the international community a comprehensive catalog of regulatory genomic features and variation in these regions that relate to human diseases. Our project is divided into a data integration (DI) and an information extraction (IE) part. In the last funding period, we developed the first text corpus annotated with regulatory information. This was used to train text-mining algorithms that can detect regulatory sequence elements in new texts. This resulted in the first text-mining-derived collection of these elements and their putative associations with genes and diseases. In addition, we have developed an entity normalization method based on Deep Neural Networks and large language models. In the second application phase, for DI, we will focus on updating and expanding the number of integrated databases and automating the integration process. In IE, we will shift our focus from entity recognition and normalization to entity relationship extraction. Training the models will require extending the annotation of the corpus to represent relationships between regulatory features and genes, variants, and diseases. We will train state-of-the-art methods for relation extraction on this extended corpus and apply the trained models to disease-specific text collections. For curating the results, we will develop an innovative method for rapid annotation that focuses on user satisfaction and ease of use, an aspect that is still understudied in currently available software tools. For quick and easy access to all regulatory feature data integrated, extracted, and curated in the project, we will develop a user-friendly web interface with intuitive visualizations that will be integrated into the RegulationSpotter website.
DFG Programme
Research Units