Project Details
Projekt Print View

Computational Language Documentation by 2025

Subject Area Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
General and Comparative Linguistics, Experimental Linguistics, Typology, Non-European Languages
Term since 2019
Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 431440013
 
The main objective of the CLD2025 project is to facilitate the urgent task of documenting endangered languages by leveraging the potential of computational methods. Thanks to improvements of machine learning tools (such as artificial neural networks, Bayesian models) breakthrough developments are now possible to effectively support linguistic annotation tasks by automizing audio transcription, text glossing and word discovery. Thorough documentation of the world’s dwindling linguistic diversity is much more feasible with these tools than under a manual workflow. For instance, manual transcription of 50 h of speech takes hundreds of hours’ work, creating a bottleneck in the language documentation workflow. Another key task, referred to in linguistics as interlinear glossing (in a nutshell: word-by-word translation/annotation), is even more time-consuming, and is difficult to perform manually with the required level of consistency. Machine learning models can aid particularly in such time-consuming tasks. But Natural Language Processing (NLP) remains little-used in language documentation, because the technology is still new and evolving rapidly, user-friendly interfaces are still under development, and there are few case studies demonstrating practical usefulness in a low-resource setting. The objective of CLD2025 is therefore to enable the implementation of these techniques in the mid term (by 2025) by developing a co-construction of models and tools by field linguists and computational linguists, and the development of interfaces and systems that allow real use by field linguists.We are building on the achievements of the BULB project in terms of corpora and modes of acquisition, as well as the development of models for transcription and segmentation. We are not developing corpora here, but rather focusing on how to exploit existing corpora. We address automatic processing problems (phoneme and tone transcription, unit discovery, automatic glossing), by validating them on endangered languages of very varied natures: Bantu Mboshi C25, Mande Kakabe, a Sino-Tibetan language, Yongning Na, and 3 Nakh-Daghestanian languages, Khinalug, Kryz, Budugh. We will perform work to leverage the results of the improved automatic processing to the linguistic work level: the automatic speech and language processing mechanisms will be used to explore phonetic-phonological issues on segmental, supra-segmental and tonal levels of the languages addressed in the project.Finally, from the beginning of the project, the focus will be on the usability of the tools and models developed. This point highlights the fundamentally interdisciplinary aspect of the work carried out here by computational scientists and field linguists. To do so, a recognized field linguist will work full-time on the project, and will participate, both through her experience and expertise in the definition, development and evaluation of the different systems developed in the project.
DFG Programme Research Grants
International Connection France
Cooperation Partner Dr. Gilles Adda
Ehemaliger Antragsteller Dr.-Ing. Sebastian Stüker, until 4/2022
 
 

Additional Information

Textvergrößerung und Kontrastanpassung