Project Details
Models of morphosyntax for statistical machine translation
Applicant
Professor Dr. Hinrich Schütze
Subject Area
General and Comparative Linguistics, Experimental Linguistics, Typology, Non-European Languages
Term
from 2009 to 2018
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 123083856
Statistical approaches to machine translation (MT) have shownthemselves to be effective in the last few years. However, whentranslating into a morphologically rich language this is not true,particularly when there is also significant syntactic divergencebetween the two languages. The quality of statistical machinetranslation (SMT) is poor in this case because of independenceassumptions made between the models of morphology, syntax andtranslation that do not reflect linguistic reality.In the first phase of the project we made significant strides intranslating into German, a morphologically rich language. We focusedon issues of linguistic representation and linguistic resources withinstatistical machine translation. We carried out original research indealing with German word formation (addressing both compounds andportmanteaus), German inflectional morphology and syntactic issues indealing with both English to German translation and German to Englishtranslation. We published seven conference publications at top rankedinternational conferences as well as two workshop contributions, andalso supervised Bachelors-level and Masters-level student workrelevant to the project.In the proposed phase 2, we will move on from afocus on linguistic representation to working on advanced machinelearning approaches for solving the linguistic problems inherent inthe difficult machine translation language pair English/German. In theprevious phase of the work, we focused on general linguistic problemsin translation. One of the most important lessons we learned inanalyzing the output of our linguistically enhanced systems is thatthe issue of the mismatch of domains between training data and testdata is a critically important issue. The training data is mostlytaken from the European parliament proceedings, but the testing datais from the news domain or many other domains (including the medicaldomain, which we will study in this phase of the project).In the proposed phase 2 of the Morphosyntax project, we will followfour main lines of work. We will extend our successful work on Germanword formation and inflectional morphology by reducing our dependenceon hand-crafted morphological resources by determining how to performsemi-supervised acquisition of morphological resources, payingparticular attention to the important issue of domainadaptation. Within hierarchical decoding (wheresyntactic formalisms are used as the representation for translation),we will study the integration of advanced machine learning methods forchoosing syntactic reorderings. We will also study the issue of addingsemantic information (in addition to syntactic information) intohierarchical decoding. Finally, we will study different ways tointegrate powerful classification approaches (required for the otherwork packages) directly into the decoder rather than using externalpre-processing or post-processing.
DFG Programme
Research Grants