Project Details
(Semi-)Automated thematic text classification as a basis for corpus-linguistic value-added services
Applicants
Professor Dr. Gerhard Heyer; Dr. Marc Kupietz; Professor Dr. Alexander Mehler; Privatdozent Dr. Roman Schneider
Subject Area
Applied Linguistics, Computational Linguistics
Term
since 2024
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 531750631
The project closes a central gap in empirical language research in the digital humanities, linguistics, and social sciences. This covers the hitherto missing topic-related content indexing of very large text corpora. Through the close integration between computer science and corpus linguistics, fine-grained classifications are carried out for highly heterogeneous text types and strongly varying document sizes. Research and work object is the German reference corpus DeReKo, which is located at the Leibniz Institute for the German Language. With currently about 53 billion words, it is by far the largest and most used research resource of German-language texts worldwide. Its intended content classification is highly relevant for many usage scenarios, ranging from sample stratification of the corpus for case studies to the creation of comparable multilingual corpora to the modeling of linguistic variability. A prerequisite for such applications is a stratification by dimensions such as time, modality, text genre, and topic. The first three can typically be determined directly from the source data. However, this is not the case for the highly relevant thematic domain. Due to the thematic diversity of text content, no suitable ontologies exist for thematic indexing. In addition, there is a general lack of training and test data explicitly tagged with thematic metadata, which significantly limits the use of supervised machine learning methods. Using DeReKo as an example, we aim to implement and evaluate for the first time a thematic classification for Big Corpus Data that is efficient, robust, open source, dynamic (i.e. no static and thus rapidly obsolete category inventory), and fully reusable. The main goals are: (i) Semantic diversity: five text classification systems will be mapped and integrated so that users can access systems of different abstractness depending on their application scenario. (ii) Ensuring openness in terms of content: In addition to established data catalogs such as DDC/UDC, open classification systematics will be integrated. This includes Wikipedia's category systems as well as Wikidata's cross-lingual language class system. Hierarchical classifiers are made trainable for dynamic application scenarios. (iii) Natural Language Pre-Processing: To mitigate the trade-off between processing quality and efficiency, we investigate the impact of alternative pre-processing routines and frameworks on quality and time. (iv) Reference corpus indexing: DeReKo will be indexed both at the level of individual texts and at the level of text sections, using the above-mentioned classification systems. (v) Semantic search: An interface for differentiated semantic searches on text and text segment level will be implemented. All classification systems provided can be used or combined for this purpose.
DFG Programme
Research data and software (Scientific Library Services and Information Systems)