Semantisches Information Retrieval aus Texten am Fallbeispiel Elektronische Berufsberatung (SIR)
Zusammenfassung der Projektergebnisse
The objective of this research project was to design and develop a cross-language information retrieval (CLIR) framework that can accurately translate short queries containing ambiguous terms. The framework should analyze latent meanings of documents and queries, which are sometimes concealed below the surface forms of words. The discovered meanings should be utilized in the text matching process across languages. We achieved this goal by a number of approaches. First, we constructed a large-scale translation dictionary from collaboratively constructed resources such as Wikipedia and from manually constructed lexical semantic resources such as WordNet for English and GermaNet for German. Secondly, we devised an original approach that automatically creates contexts of short queries, along with translation methods that can utilize this extra knowledge in the translation process for more accurate translation. Lastly, we developed a semantics-based CLIR framework that includes latent semantic analysis models for the source and the target language and a correlation analysis method that maps the two languages into the same semantic space. Our experiments on standard CLIR datasets show that the proposed approaches work significantly better than previously reported methods. Also, our methods are potentially easy to extend to different language pairs and to multilingual application environments. We (semi-)automatically added sense definitions to GermaNet, as comprehensive sense definitions enhance the usability of the German wordnet for a wide variety of NLP applications. Using a semi-automatic procedure we aligned more than 20,000 lexical entries, which were manually checked and added to the GermaNet database. Different alignment algorithms were developed. The bilingual ILI records of GermaNet have been expanded by about 9,000 to a total of about 29,000 records, thus representing a comprehensive basis for CLIR applications. Furthermore, these ILI records provide the potential for linking concepts to wordnets of other languages than English.
Projektbezogene Publikationen (Auswahl)
- 2011. Semi-Automatic Extension of GermaNet with Sense Definitions from Wiktionary. In: Proceedings of 5th Language & Technology Conference, pp. 126-130
Verena Henrich, Erhard Hinrichs, and Tatiana Vodolazova
- 2011. UKP at CrossLink: Anchor Text Translation for Cross-lingual Link Discovery. In: Proceedings of the 9th NTCIR Workshop Meeting on Evaluation of Information Access Technologies: Information Retrieval, Question Answering, and Cross-Lingual Information Access, pp. 487-494
Jungi Kim and Iryna Gurevych
- Learning Semantics with Deep Belief Network for Cross-Language Information Retrieval. In: Proceedings of the 24th International Conference on Computational Linguistics, pp. 579-588, December 2012
Jungi Kim, Jinseok Nam, and Iryna Gurevych
- 2013. Using Part-Whole Relations for Automatic Deduction of Compound-Internal Relations in GermaNet. Language Resources and Evaluation, special issue on “Wordnets and Relations”, 47 (3), 839-858
Erhard Hinrichs, Verena Henrich, and Reinhild Barkey.
- The People’s Web Meets NLP: Collaboratively Constructed Language Resources 2013. Springer
Iryna Gurevych and Jungi Kim (eds.)
- 2014. Large-scale Multi-label Text Classification - Revisiting Neural Networks. In Proceedings of the 7th European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, pp. 437-452
Jinseok Nam, Jungi Kim, Eneldo Loza Mencía, Iryna Gurevych, and Johannes Fürnkranz
(Siehe online unter https://doi.org/10.1007/978-3-662-44851-9_28)