Project Details
Context-based discovery of functional motifs in low complexity regions of protein sequences
Applicant
Professor Miguel Andrade-Navarro, Ph.D.
Subject Area
Bioinformatics and Theoretical Biology
Term
from 2017 to 2021
Project identifier
Deutsche Forschungsgemeinschaft (DFG) - Project number 387883086
Low complexity regions compose more than a third of all protein sequences. These protein regions have been considered mere linkers between globular structured domains, since they lack conservation, evolve quickly and are very variable when comparing homologous proteins in closely related species. However, research from us and others is increasingly providing evidence that low complexity sequences in proteins have functions, for example holding sites for post-translational modifications with regulatory effects, or being involved in the modulation of interactions of proteins with other proteins or with DNA or RNA. These functional sites are usually identified as short Linear Motifs (LMs) of two to ten amino acids.While protein sequence analysis works well for the identification of functional domains in proteins, detection of functional LMs in low complexity sequences is more difficult. Recognition of patterns of amino acids can be used, but the function of LMs is often determined by their context, not just by their sequence. This leads to high rates of false positives when recognizing functional LMs within low complexity regions. To solve this problem, we propose to combine pattern recognition of LMs with analysis of their context, (i) within sequences (e.g., co-occurrence with functional domains), (ii) in the cell/organism (subcellular location, interacting protein partners, tissue), and (iii) at the taxonomic level (species distribution).Most importantly, we will also test for the avoidance of LMs in given protein contexts, an approach that was demonstrated successfully in genomic sequences: avoided motifs will reveal functional motifs that would be deleterious in the wrong molecular context. The results of our project should lead to the identification of thousands of novel putative functional LMs on proteins (even of length one), given the large amount of unexplored low complexity regions in the proteins deposited in the databases. These LMs will be provided in a dedicated database and the results will be integrated in the EML resource (the major repository of motif annotations in protein sequences based at EMBL-Heidelberg).
DFG Programme
Research Grants