Grammar Formalisms beyond Context-Free Grammars and their use for Machine Learning Tasks
Final Report Abstract
The project BeyondCFG addressed the question of how to deal with discontinous constituents in parsing and machine translation. A particular focus was on approaches based on mildly contextsensitive grammar formalisms. We developed new models and algorithms for probabilistic constituency parsing and for statistical machine translation, using formalisms such as Linear Context-Free Rewriting Systems (LCFRS) and variants of Tree Adjoining Grammar (TAG), extensions of context-free grammars (CFG) that combine aspects of synchronous grammars with the capacity to describe discontinuities. The project developed new mildly context-sensitive (MCS) grammar formalisms, investigated their formal properties and developed both symbolic as well as statistical parsers. The latter yield transparent, grammar-based characterizations of syntactic structure while achieving state-of-theart parsing accuracy. The project also developed the first approach to grammar-less, transitionbased parsing of discontinous constituents. Linked to discontinous constituency parsing, BeyondCFG also developed several methods for treebanking, combining approaches such as active learning with an intuitive annotation interface. Finally, Beyond CFG also developed a grammarbased statistical machine translation system that allows for discontinuous constituents and complex types of alignment. One topic that was not planned in the beginning was morpho-syntactic processing of Arabic. Due to the lack of Arabic constituency treebanks of sufficiently high quality at the time, our focus moved from constituency parsing to morphology. Arabic is interesting in this respect since it displays discontinuous units in morphology. An additonal complication was that many texts in Arabic come with code switching between some dialect and Modern Standard Arabic. In the context of morphosyntactic processing of Arabic, the project constributed important results to segmentation, language identification and POS tagging for Arabic NLP. The project has produced several implementations, comprising several parsers, tools for processing discontinous constituency trees, and tools for Arabic NLP, that are publicly available and that are still in use.
Publications
- 2015. Discontinuous Incremental Shift-reduce Parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 1202–1212. Beijing, China: Association for Computational Linguistics
Maier, W.
(See online at https://doi.org/10.3115/v1/P15-1116) - 2015. Hierarchical Machine Translation With Discontinuous Phrases. In Proceedings of the Tenth Workshop on Statistical Machine Translation, 228–238. Lisbon, Portugal: Association for Computational Linguistics
Kaeshammer, M.
(See online at https://doi.org/10.18653/v1/W15-3028) - 2015. On the Mild Context-Sensitivity of k-Tree Wrapping Grammar. In Proceedings of the 20th and 21st International Conferences on Formal Grammar - Volume 9804, 77–93. Berlin, Heidelberg: Springer-Verlag
Kallmeyer, L.
(See online at https://doi.org/10.1007/978-3-662-53042-9_5) - 2016. Data-oriented parsing with discontinuous constituents and function tags. Journal of Language Modelling 4(1). 57–111
van Cranenburgh, A., R. Scha & R. Bod
(See online at https://doi.org/10.15398/jlm.v4i1.100) - 2016. Discontinuous parsing with continuous trees. In Proceedings of the Workshop on Discontinuous Structures in Natural Language Processing, 47–57. San Diego, California: Association for Computational Linguistics
Maier, W. & T. Lichte
(See online at https://doi.org/10.18653/v1/W16-0906) - 2016. LR Parsing for LCFRS. Algorithms 9(3)
Kallmeyer, L. & W. Maier
(See online at https://doi.org/10.3390/a9030058) - 2016. Multilingual Code-switching Identification via LSTM Recurrent Neural Networks. In Proceedings of the Second Workshop on Computational Approaches to Code Switching, 50–59. Austin, Texas: Association for Computational Linguistics
Samih, Y., S. Maharjan, M. Attia, L. Kallmeyer & T. Solorio
(See online at https://doi.org/10.18653/v1/W16-5806) - 2017. Learning from Relatives: Unified Dialectal Arabic Segmentation. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), 432–441. Vancouver, Canada: Association for Computational Linguistics
Samih, Y., M. Eldesouki, M. Attia, K. Darwish, A. Abdelali, H. Mubarak & L. Kallmeyer
(See online at https://doi.org/10.18653/v1/K17-1043) - 2018. Active DOP: A constituency treebank annotation tool with online learning. In Proceedings of COLING system demonstrations, 38–42
van Cranenburgh, A.
- 2019. A Neural Graph-based Approach to Verbal MWE Identification. In Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019), 114–124. Florence, Italy: Association for Computational Linguistics
Waszczuk, J., R. Ehren, R. Stodden & L. Kallmeyer
(See online at https://doi.org/10.18653/v1/W19-5113) - 2019. From partial neural graph-based LTAG parsing towards full parsing. Computational Linguistics in the Netherlands Journal 9. 3–26
Bladier, T., J. Waszczuk, L. Kallmeyer & J. Janke
- 2020. Statistical Parsing of Tree Wrapping Grammars. In Proceedings of the 28th International Conference on Computational Linguistics, 6759– 6766. Barcelona, Spain (Online): International Committee on Computational Linguistics
Bladier, T., J. Waszczuk & L. Kallmeyer
(See online at https://doi.org/10.18653/v1/2020.coling-main.595)