Project Details
A Regular Grammar-Aware Deep Seq2seq Genome Foundation Model and Genome Annotation
Applicant
Professor Dr. Mario Stanke
Subject Area
Image and Language Processing, Computer Graphics and Visualisation, Human Computer Interaction, Ubiquitous and Wearable Computing
Bioinformatics and Theoretical Biology
Bioinformatics and Theoretical Biology
Term
since 2024
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 546839540
Gene prediction is a fundamental challenge in genomics that stands to be radically advanced through dedicated foundation models that harness recent progress in self-supervised learning and long-range sequence modeling. Genes that encode proteins in eukaryotic genomes follow a particular grammar of 3-periodic coding regions interrupted by potentially very long non-coding regions. Currently, tools that enforce the grammar of such gene structures with a hidden Markov model (HMM) achieve the best performance on this task. However, these methods do not learn the HMM's parameters jointly with deep learning representations of the input. End-to-end deep learning approaches of sequence-to-sequence (seq2seq) models like transformers have been gaining significant ground and promise to further the state of the art if certain challenges can be overcome. First, standard transformers are prohibitive for the long input contexts required for genomes. Second, transformers are ill-suited to model grammars. We propose developing a regular grammar-aware deep seq2seq model (REGRADS) that consolidates the representational capacity of transformer-like models with the inductive biases of the model families of conditional random fields and hidden Markov models tailored for genomics. As transformer-like component we will study selected seq2seq layers that - in contrast to regular attention - scale subquadratically with the input length. This includes a novel form of attention that we propose that scales linearly with input length. REGRADS will be pre-trained unsupervised and semi-supervised on a large corpus of genomes from species that cover the tree of life. It will serve as a versatile foundation model for gene prediction and other biologically relevant tasks. As such, it will provide the wider genomics community with a resource. Specific deliverables are: 1) The REGRADS model architecture effectively integrating long-range seq2seq layers with grammar-based structures;2) genome foundation model through multi-genome pretraining of REGRADS; 3) Demonstrated improvements on gene prediction benchmarks using the foundation model.
DFG Programme
Research Grants
Co-Investigator
Professor Dr. Joscha Diehl