Project Details
Optimizing graph databases focussing on data processing and integration of machine learning for large clinical and biological datasets
Subject Area
Data Management, Data-Intensive Systems, Computer Science Methods in Business Informatics
Bioinformatics and Theoretical Biology
Bioinformatics and Theoretical Biology
Term
since 2021
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 463414123
Graph databases represent an efficient technique for storing and accessing highly interlinked data using a graph structure, such as connections between measurements and environmental parameters or clinical patient data. Its flexible node structure makes it easy to add the results of different examinations covering simple blood pressure measurements, the latest CT and MRT scans, or high-resolution omics analyses (e.g., from tumor biopsies, gut microbiome samples). However, the full potential of data processing and analyses using graph databases is not yet exploited completely in biological and clinical use cases. Especially the huge amount of interconnected data to be loaded, processed, and analyzed results in too long processing times to be integrated into clinical workflows. To this end, novel graph-operator optimizations, as well as a suitable integration of analysis approaches, are necessary.This proposal aims to solve the aforementioned problems in two directions: (i) proposing suitable optimizations for graph database operations, also incorporating the usage of modern hardware, and (ii) the integration of machine learning algorithms for an easier and faster analysis of the biological data. For the first direction, we investigate the state of the art in graph database systems and their storage as well as their processing model. Subsequently, we propose optimizations for efficient graph maintenance and analytical operators. For the second direction, we envision to bring machine learning algorithms closer to their data providers – the graph databases. To this end, as a first step, we feed machine learning algorithms directly with the graph as input by designing suitable graph operators. As a second step, we integrate machine learning directly into the graph database by adding special nodes to represent the model of the machine learning algorithm.The results of our project are improved operators exploiting modern hardware as well as integration concepts for machine learning algorithms. Our generally devised approaches will push operating and analyzing huge graphs in a plethora of use cases beyond our aimed use case of biological and clinical data analysis.
DFG Programme
Research Grants
Co-Investigator
Dr.-Ing. David Broneske