Project Details
SENLP - Software Engineering knowledge of NLP models
Applicant
Professor Dr. Steffen Herbold
Subject Area
Software Engineering and Programming Languages
Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
Term
since 2023
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 524228075
The transformer architecture has changed the field of Natural Language Processing (NLP) and paved the way for models such as BERT and GPT. These models have in common that they use transfer learning in the form of pre-training to learn a general representation of language, which can then be fine-tuned or prompted to perform various downstream tasks. While these models achieve remarkable results in a variety of NLP tasks, it is often unclear why they perform well on specific tasks and how well they work in different domains, such as Software Engineering (SE). Within our prior work, we looked at the impact of domain-specific pre-training on NLP tasks within the SE domain and we found that for polysemous words like "bug" (insect vs. defect) or "root" (plant vs. User), the domain-specific pre-training did help with the understanding of the meaning in the SE domain and that this also led to better performance in domain-specific downstream task. Within this project, we want to deepen our understanding of the capability of NLP models to capture concepts from the SE domain, with a focus on SE definitions and commonsense knowledge. We will use the analogy of NLP models as students to understand how they would perform in SE exams. For example, we will test if the NLP models contain accurate SE definitions and terminology: can the NLP models spot the correct definition of a term in a multiple-choice test, can they generate accurate definitions given a prompt, are they able to understand if definitions are synonyms, and can they differentiate between similar concepts with important differences and, given a prompt, even explain the small differences. A known limitation of large language models for the general domain is that they always answer, even if you give them inputs based on wrong assumptions. We will try to understand if we find similar aspects for the SE domain, e.g., by looking at how models react on prompts asking them which tools can be used to execute automated manual tests or what the best object-oriented design patterns for Haskell are. Through our work, we not only try to identify if we get nonsense responses, but also if we can find methods to infer that generated responses are nonsense, as is possible in the general domain. Additionally, we study the above aspects for different types of models: smaller models with an encoder-only transformer architecture (e.g., BERT), larger encoder-only models (e.g., RoBERTa), models with variations of the transformer architecture to allow for longer contexts (e.g., Big Bird), GPT-style decoder-only models (e.g., GPT-NeoX), and encoder-decoder models (e.g., T5). We will consider both SE-specific pre-training, as well as models trained on general domain data. Since some general domain models were already pre-trained with a corpus that included SE data this also allows us to understand if SE knowledge is sufficiently captured if this is only a smaller part of a very large data set.
DFG Programme
Research Grants