Project Details
Projekt Print View

MASS-TEXT: Massive Text Indices

Subject Area Theoretical Computer Science
Data Management, Data-Intensive Systems, Computer Science Methods in Business Informatics
Term from 2014 to 2022
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 254939031
 
Texts are one of the showcases for Big Data. The world wide web, digital libraries, biological sequences like DNA or proteins: they all constitute large textual data that need to be stored, structured, compressed, and searched efficiently. The amount of such data has grown much faster than the storage and computation capacities of common desktop computers, by several orders of magnitude. For example, while ten years ago the human genome (summing up to about 3 Gigabytes) was considered as being a large data set, nowadays we want to analyze the genomes of 100,000 humans or other kind of beings simultaneously. This trend for larger and larger sequences continues due to the decreasing costs for sequencing.Data structures for texts satisfying the above mentioned needs (storage, structuring, search, and sometimes even compression) are called text indexes. Well-known services such as internet search engines (Google, Yahoo, etc.) use a rather simple index data structure called "inverted index". However, these data structures have weaknesses when processing arbitrary queries or unstructured texts such as in bioinformatics and they also have query costs that grow linearly with the amount of data. Our project focuses on indices related to suffix arrays that allow us to find arbitrary patterns in time independent of the amount of data. The overall goal of the project is to scale these data structures to very large inputs using modern hardware.While we made substantial progress in the first phase of the project, substantial challenges remain that we want to pursue in the second phase. In particular, massively parallel distributed memory algorithms for all phases of index construction and load balanced query algorithms. Furthermore, space consumption during construction and querying should be reduced substantially. Finally, we want to consider advanced queries involving ranking and approximate matches.
DFG Programme Priority Programmes
Ehemalige Antragsteller Dr. Simon Gog, until 12/2018; Professor Dr. Peter Sanders, from 12/2018 until 3/2021
 
 

Additional Information

Textvergrößerung und Kontrastanpassung