Project Details
EVIDENCE: computer-assisted interactive extraction of good dictionary examples from large corpora
Subject Area
General and Comparative Linguistics, Experimental Linguistics, Typology, Non-European Languages
Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
Term
since 2019
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 433249742
The project will bring together computer scientists and lexicographers in solving a lexicographical problem, i.e. the identification and extraction of good examples from a large set of corpus examples. Machine learning will be applied to help lexicographers in selecting good examples from corpora for inclusion in dictionary articles. The application of machine learning should facilitate the task of the lexicographers by ranking the examples according to their measured quality and therefore direct the attention of the lexicographers to the best examples. Since quality and appropriateness of examples from corpora are not well-defined features, unanimous judgment cannot be achieved even among professional lexicographers. With interactive learning, we plan to train an adaptive machine learning model on preferences which we assume are more unanimous for different lexicographers since it is more likely that they agree on example 1 being better than example 2 than agreeing on explicit scores for both examples. Furthermore, it is planned to acquire and integrate the judgment of dictionary users (i.e. informed lay persons) on sets of ranked good examples. The outcome of the project will be a system for the extraction, classification, and ranking of corpus examples. This system will initially be tested in the context of the DWDS. There it will support the lexicographers in their daily work. It is expected that for each headword the final system will present a set of good examples that are sufficiently diverse to illustrate various facets of the real use of this word. Furthermore, it will generate an additional value for non-expert dictionary users, as it will supply good examples also for headwords that have not yet received full lexical treatment. The new system will allow any user to provide feedback on the quality of examples which are used by the system to learn. E.g. in the context of teaching, students no longer only consume, but actively participate in the development of a lexicographic resource. The project will also organize workshops to acquire early adopters and to gather feedback from the community. Thus, the proposed method and its application will be useful for other dictionary projects as they are language independent and easy to integrate into current state-of-the-art systems for lexicography.
DFG Programme
Research data and software (Scientific Library Services and Information Systems)