Project Details
Inducing syntactic structure
Applicant
Professorin Dr. Laura Kallmeyer
Subject Area
Applied Linguistics, Computational Linguistics
Term
since 2024
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 545523981
The starting point of this project is the observation that (i) across syntactic theories, across treebank formats and across languages, a large variety of syntactic structures have been proposed; and (ii) it has been shown that self-supervised contextual language models (LMs) capture syntactic information to a certain extent though it is not clear how these models generalize. In this project, we want to remain neutral with respect to the underlying theory and we want to induce syntactic constituency structure in an unsupervised way from LMs. We will experiment with different types of neural network architectures that make different assumptions concerning the overall hierarchical structures that we extract. Our central research questions are: Q1 How can we automatically learn syntactic structure from processing raw text? Q2 How do the emerging structures relate to established constituency from linguistic theory? Q3 How useful are the emerging structures for NLP applications? To address Q1, we will induce syntactic structure in an unsupervised way from raw text. We will focus on groupings of tokens into phrases and on the categories of these phrases, i.e., our principal focus is on constituency structure. However, we will also look into identifying the syntactic heads of constituents, which will allow to induce also a dependency structure. We will perform syntax induction on a range of different languages. Concerning Q2, we will compare our results to a range of existing syntactic theories and annotation schemes. This way, we hope, on the one hand, to find empirical evidence for certain assumptions made in syntactic theory and on the other hand, to identify a constituency format that emerges from text data and that might therefore be a good candidate to be used in syntactic parsing and annotation. Q3 aims at assessing the latter. Ideally, a syntactic annotation format should be such that it contains enough syntactic detail to provide valuable information for downstream tasks while being sufficiently general and learnable to allow for high quality annotation and parsing. In order to evaluate the usefulness of the emerging syntactic structures in NLP contexts, we will integrate the results from different induction approaches into supervised parsing architectures and into several downstream tasks.
DFG Programme
Research Grants
International Connection
Canada
Cooperation Partner
Professor Dr. Hassan Sajjad