Project Details
Interactive distributed corpus exploration and annotation infrastructure for large corpora and knowledge-bases
Subject Area
Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
Term
from 2016 to 2022
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 315979217
The goal of this project is a research infrastructure for corpus annotation that scales to large text document collections by flexibly building subcorpora. The infrastructure addresses the needs of computational linguists and corpus linguists for a generic tool to perform selective semantic annotation tasks within and across documents. Such an infrastructure is important because it enables the targeted exploitation of the huge amounts of digitally available text for linguistic analysis. The expert user should be supported by the infrastructure in exploring the large document collections, in setting up an annotation scheme, and in extracting task-specific subcorpora from a large background corpus. The annotation of the corpora should be flexibly distributable to remotely working annotation teams of different qualification levels and backgrounds. Their work should be supported through prioritisation and annotation suggestions based on machine learning technology to efficiently create a large corpus with high-quality annotations for training and evaluating the respective algorithms. Thus, infrastructure should enable the annotation of the same corpus from multiple perspectives by multiple researchers and annotations teams working in parallel. Custom corpora should be importable by the users as needed. Further functionality is needed to maintain and expand the knowledge bases used during the semantic annotation tasks as well as to connect to external standard knowledge bases.
DFG Programme
Research data and software (Scientific Library Services and Information Systems)