Project Details
Cross-language Learning-to-Rank for Patent Retrieval, Phase 2: Weakly Supervised Learning of Cross-lingual Systems
Applicant
Professor Dr. Stefan Riezler
Subject Area
Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
General and Comparative Linguistics, Experimental Linguistics, Typology, Non-European Languages
General and Comparative Linguistics, Experimental Linguistics, Typology, Non-European Languages
Term
from 2012 to 2019
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 211613886
Cross-lingual technologies such as machine translation or cross-lingual information retrieval crucially rely for their learning on human-sourced supervision in the form of in-domain sentence-parallel document collections or relevance rankings for each pair of languages. Any sentence or document outside of the curated data is understood as not a translation or as irrelevant (strong supervision). Collecting and proof-reading such information is a hugely onerous and expensive task and is achieved only for very narrow domains even in well-resourced languages. At the same time, data parallelism governs the overall performance in a fundamental way: First, as it is present early in the learning process, its errors and idiosyncrasies propagate down the learning pipeline. Second, modeling with strong supervision conflicts with the flexibility of natural language and handicaps domain- and task-adaptation of cross-lingual applications.As we have shown in the current DFG project, strong supervision is not strictly necessary for the application of cross-lingual retrieval. One of the most important outcomes of the first phase of the project is the development of methods that yield extraordinary improvements for cross-lingual information retrieval by learning cross-lingual rankings directly from data that are weakly supervised by relevance indicators such as citations in patents or hyperlinks in Wikipedia pages, but are not strictly parallel. In the proposed second phase of the project we intend to turn the idea on its head by applying the techniques that have been successful for learning-to-rank for cross-lingual retrieval to discriminative training of machine translation on massive non-parallel data, and in the process, further improve our methods for cross-lingual retrieval. The key ingredients of our proposed techniques will be the combination of learning from weakly supervised data with techniques that best deploy the weak supervision signals by using fine-grained sparse features and attempt at learning from positive and negative examples.We motivate our research by an application to translation and cross-lingual retrieval in the medical domain where massive amounts of quasi-parallel training data are available on the Internet, in research publications, and patent data. Furthermore, public data from a recent benchmark testing on medical translationare available for evaluation.
DFG Programme
Research Grants