Generate-IT. LLM-Generated Texts in Italian: A Linguistic Study

Applicant Professorin Dr. Anna-Maria De Cesare Greenwald

Subject Area Individual Linguistics, Historical Linguistics

Term since 2024

Project identifier Deutsche Forschungsgemeinschaft (DFG) - Project number 550431710

Project Description

The advent of large language models (LLMs), such as GPT-3.5, completely revolutionized our ability to generate human-like text in a wide range of languages, including Italian. Several studies – mainly focused on English – claim or show that LLM-generated outputs are high quality texts, in many respects comparable to and indistinguishable from human-written texts. At the same time, it has been observed that LLM-generated texts can suffer from an “algorithmic bias” and can even include patterns and structures that are English-like. English ‘fingerprints’ in Italian generated texts do not come as a surprise: in LLMs such as the GPT-suites, English texts are overrepresented in the training data. In light of the current situation, many research questions arise in the field of Linguistics, in particular on the characteristics of LLM-generated texts written in Italian (and other languages). A first set of questions concerns the influence of English on these texts: What forms can the above-mentioned ‘fingerprints’ take, how frequent are they, and how persistent do they appear in the outputs of LLMs trained on datasets where English texts have different weights? A second important question is whether we are witnessing the emergence of a new language variety in the architecture of contemporary Italian, and whether this variety appears to be impoverished and simplified compared to human-authored texts due to the algorithmic bias endemic to LLMs. The aim of the DFG research project “Generate-IT. LLM-Generated texts in Italian: A Linguistic Study” is to address these open and timely questions by describing and explaining the characteristics of LLM-generated texts in Italian. The project will also address theoretical issues, in particular the need to consider a new dimension of language variation, related to the medium used to produce language (artificial neural networks). An important aspect to consider is the nature of the linguistic features that are relevant in this dimension of language variation. These questions will be addressed by carrying out an empirical study, based on self-assembled representative corpora of LLM-generated texts and comparable corpora of human-written texts. Overall, the DFG project aims to develop a new, dynamic, and innovative line of research in the field of (Italian) Linguistics. It will complement research on generated texts conducted in neighboring fields, (specifically in Computational Linguistics and Natural Language Generation) and it will pave the way to future interdisciplinary studies between and linguists and LLM-developers.

DFG Programme Research Grants

Servicenavigation

Hauptnavigation

Generate-IT. LLM-Generated Texts in Italian: A Linguistic Study

Additional Information

Servicenavigation

Hauptnavigation

Generate-IT. LLM-Generated Texts in Italian: A Linguistic Study

Additional Information

Textvergrößerung und Kontrastanpassung