Project Details
Breaking the Unwritten Language Barrier
Subject Area
Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
Term
from 2014 to 2019
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 259117245
The BULB project aims at supporting the documentation of unwritten languages with the help of automatic speech and language processing, in particular automatic speech recognition (ASR) and machine translation (MT). We will address the documentation of three mostly unwritten African languages of the Bantu family (Basaa, Myene and Embosi). The main steps of the project are:1. To collect the corpora at a reasonable cost, using a three step methodology, following the work of S. Bird and M. Liberman:collecting a large corpus of speech (100 hours) in a community, including elicited material, stories, dialogs and broadcasts;re-speaking. As the sound quality of the recordings will be very spontaneous, with possibly overlapping speech in noisy environments, carefully articulated re-speaking by a reference speaker will give rise to more accurate automatic phonetic transcriptions and to improved material for phonetic/phonological studies.oral translation. Translation is the natural way to document a new language; oral translations will accelerate the documentation process. Our Bantu data will be translated to French, a major language and a second language in the regions of our studied communities.2. The collected oral data (Bantu originals and French translations) contain the necessary information to document the studied languages. ASR is expected to automatically produce accurate transcriptions in source and target languages and MT to provide meaningful alignments between both, to speed up the major tasks of documentation, description and analysis. The major automatic processing steps are:phonetic transcription of the studied languages. This step requires first a set of language-independent phone models which must be tuned to the language under study via unsupervised adaptation techniques;word transcription of the oral French translations. Language and acoustic models need to be adapted to obtain high transcription accuracy;alignments between the phonetic transcriptions (originals, respeaking) of the studied language. Alignments are highly valuable for large scale acoustic-phonetic studies, phonological and prosodic data mining and dialectal variations studies;cross-language alignments that aim at linking phone sequences in the studied language with French words. Such alignments may prove very useful for morphological studies, vocabulary and pronunciation elaboration.The success of the project relies on a strong German-French cooperation between linguists and computer scientists. Cooperations will be fostered and strengthened by a series of courses benefiting the scientific community beyond the present consortium. During these courses, linguists will present to computer scientists the major steps to document an unknown language, and computer scientists will introduce their methods to process a "new" language thus generating phonetic transcriptions and pseudo-word alignments to be returned to linguists.
DFG Programme
Research Grants
International Connection
France