Project Details
MASS-TEXT: Massive Text Indices
Applicant
Professor Dr. Johannes Christian Fischer, since 3/2021
Subject Area
Theoretical Computer Science
Data Management, Data-Intensive Systems, Computer Science Methods in Business Informatics
Data Management, Data-Intensive Systems, Computer Science Methods in Business Informatics
Term
from 2014 to 2024
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 254939031
Texts are one of the showcases for Big Data. The world wide web, digital libraries, biological sequences like DNA or proteins: they all constitute large textual data that need to be stored, structured, compressed, and searched efficiently. The amount of such data has grown much faster than the storage and computation capacities of common desktop computers, by several orders of magnitude. For example, while ten years ago the human genome (summing up to about 3 Gigabytes) was considered as being a large data set, nowadays we want to analyze the genomes of 100,000 humans or other kind of beings simultaneously. This trend for larger and larger sequences continues due to the decreasing costs for sequencing.Data structures for texts satisfying the above mentioned needs (storage, structuring, search, and sometimes even compression) are called text indexes. Well-known services such as internet search engines (Google, Yahoo, etc.) use a rather simple index data structure called "inverted index". However, these data structures have weaknesses when processing arbitrary queries or unstructured texts such as in bioinformatics and they also have query costs that grow linearly with the amount of data. Our project focuses on indices related to suffix arrays that allow us to find arbitrary patterns in time independent of the amount of data. The overall goal of the project is to scale these data structures to very large inputs using modern hardware.While we made substantial progress in the first phase of the project, substantial challenges remain that we want to pursue in the second phase. In particular, massively parallel distributed memory algorithms for all phases of index construction and load balanced query algorithms. Furthermore, space consumption during construction and querying should be reduced substantially. Finally, we want to consider advanced queries involving ranking and approximate matches.
DFG Programme
Priority Programmes
Subproject of
SPP 1736:
Algorithms for Big Data
Ehemalige Antragsteller
Dr. Simon Gog, until 12/2018; Professor Dr. Peter Sanders, from 12/2018 until 3/2021