Project Details
Indexing cluster for multimodal documents
Subject Area
Computer Science
Term
Funded in 2024
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 528131420
For research with applied artificial intelligence, ever larger and more diverse data collections (corpora) are being created and used. In the typical research process, the data sets grow even further by enriching them with more information through methods from areas such as language technology or visual computing and thus making them accessible for the users. The primary goal here is to provide users with the richest and most accessible information possible from the data. For accessibility, there are so-called indexing technologies that make data searchable and quantitatively visualizable. The technological challenge arises from the amount of data and the type of use. The simpler the information access is to be designed, the more complexity shifts to the indexing process, which requires computationally intensive processes with very large amounts of data. Furthermore, the resulting indexes must be kept for a long time and be usable in a performant way in order to support source-oriented research projects permanently and in a user-oriented way. All tools used to access the enriched, searchable data must therefore be able to handle large, repeatedly updated data volumes in a performant manner over a long period of time. The requested device will be used for natural language preprocessing, and indexing of very large multimodal document collections. In addition to texts, images, videos, audio files, number series, and relationships between data will be processed. The complex linguistic preprocessing steps include e.g. the extraction of proper names ("Named Entitiy Recognition", NER) of persons, companies and places, the grammatical preprocessing with e.g. dependency parses, and the transformation into semantic representations, which allow a content-based search considering word meaning. Preprocessing for audio and video consists in their segmentation and indexing by automatic transcription and object classification. The index cluster is optimized for the use of Apache Spark as a data-parallel processing framework for pre-processing and for the use of Elasticsearch as a distributed, high-performance NoSQL index, perspectively also for Apache Hudi for data lakes. The index cluster consists of 5 server units (nodes), each connected in pairs with Infiniband network. Each of these nodes is equipped with a GPGPU (48 GB RAM) for running neural processing models and two CPUs with at least 16 cores each and at least 1TB RAM for running data parallel processing and distributed indexing. For data processing and index provisioning, each node includes approximately 100TB of SSD storage capacity. The applicant has many years of experience with these applications and already operates a corresponding pre-device that needs to be replaced.
DFG Programme
Major Research Instrumentation
Major Instrumentation
Indexcluster für Multimodale Dokumente
Instrumentation Group
7030 Dedizierte, dezentrale Rechenanlagen, Prozeßrechner
Applicant Institution
Universität Hamburg