Computational methods for mapping of regulatory elements from large-scale RNA-sequencing compendia


Prof. Dr. Julien Gagneur
Computational Biology
TU München

Project Overview

The control of gene expression, i.e. how much of a gene product is available in the cell, is essential to cell biology. However, despite decades of research on gene regulation, no computational model is able to predict gene expression levels in a given cell type from genomic sequence. Consequently, most of the variants associated with common diseases, which are non-coding, cannot be interpreted. Also, no genetic diagnosis can be provided for the majority of patients with rare disease that show no obvious disease-causing coding variant. The genome of every human being contains about one hundred of such de novo mutations, i.e. genetic variants that are not present in parental genomes (Veltman & Brunner, 2012). Such de novo mutations can be extremely harmful and play a prominent role in rare and common diseases (Veltman & Brunner, 2012). Classical genetic association approaches fail in the case of very rare generic variants for which little data can be collected across the population. Hence, mechanistic models that are able to predict regulatory effects of de novo mutations solely from sequence are necessary.

In line with this need, our group is developing models and approaches to predict the quantitative effects of any genetic variation on gene expression, with a focus on posttranscriptional regulation. We have recently established a systematic approach combining i) genome-wide in vivo quantification of RNA metabolism rates using sequencing of metabolically labeled RNA, ii) identification of sequence elements predictive for these rates, and iii) functional validation of these elements using expression profiles of genetically distinct individuals (Eser et al, 2016). Using fission yeast as a model system, this recovered known DNA and RNA regulatory elements, quantified the contributions of individual bases to RNA synthesis, splicing, and degradation, and uncovered novel motifs that regulate RNA life-time. The model was trained on a single genome yet could correctly predict the effect of genetic variants on gene expression of other genomes that were not part of the training data. More recently, we built a quantitative model for Saccharomyces cerevisiae, which explains about 60% of mRNA half-life variation between genes based on mRNA sequence features alone and predicts mRNA half-life on unseen data at a median relative error of 30%, i.e. close to measurement uncertainty (Cheng et al, 2017). The model integrates known functional cisregulatory elements, identifies novel ones, and quantifies their contributions at singlenucleotide resolution.

Our lab is now moving to the analysis of human gene expression. Together with the lab of Patrick Cramer (MPI for biophysical chemistry, Göttingen), we have developed Transient Transcriptome Sequencing (TT-seq), a protocol that allows probing of RNA metabolism in human cells (Schwalb et al, 2016). TT-seq allows studying RNA metabolism down to the level of individual phosphodiester bounds (Wachutka & Gagneur, 2016). Another fruitful collaboration is with the group of Holger Prokisch, from the Institute of Human Genetics (TUM-Med and Helmholtz-Zentrum, Munich), a group working on rare metabolic diseases. Together with this group we have developed methods to diagnose patients using RNAsequencing (Kremer et al, 2016). In parallel to the datasets we are collecting in the context of our collaborations, the generation of omics data worldwide is exploding, doubling every 7 months (Stephens et al, 2015). Large omics datasets are shared for research purposes (ENCODE, Roadmap epigenomics, GTEx (ENCODE Project Consortium, 2012; Bernstein et al, 2010; GTEx Consortium, 2013) and further efforts aim at extending data sharing policies and providing cloud services to allow massive analyses on them. These data include quantitative measurements of gene expression: transcriptome profiling, chromatin states, and nearly comprehensive quantification of protein levels. These data will eventually cover all human natural genetic variations and hence offer the possibility to systematically decipher human genetic regulatory code. Consequently, new scalable methodologies are required to learn predictive and mechanistic models of gene expression from very large omics datasets.

Recent studies have shown that convolutional neural networks are effective tools for modeling sequence elements (Alipanahi et al, 2015; Zhou & Troyanskaya, 2015). These methods not only allow very rich model architectures, but also leverage on powerful computational frameworks with great scaling capabilities. Building on this work, we have developed a new algorithm, CONCISE (Convolutional neural Network for CIS-regulatory Elements), which we applied to model cis-regulatory elements controlling mRNA stability. The distinctive features of CONCISE are i) motif units, consisting of a sequence filter and of a smooth function modeling relative positional effect, ii) efficient initialization of motifs using linear mixed effect models on k-mers, and iii) multi-task learning setting to model variation in activity of the trans-acting factors. Moreover, CONCISE handles sequences with variable lengths and allows for covariates. Notably, CONCISE models can be physically interpreted. Specifically, CONCISE is able to encode the cost function of the biophysical model featureREDUCE (Riley et al, 2015) by using exponential activation functions and Huber loss function. Applied to a genome-wide mRNA half-life data in S. pombe (Eser et al, 2016) and S. cerevisiae (Sun et al, 2013) CONCISE improved the half-life prediction in cross-validation compared to previous methods (Cheng et al, 2017; Eser et al, 2016) (Fig. 2), and enhanced predictions of effects on gene expression of single-nucleotide variants. CONCISE has been implemented using TensorFlow and it was released as an open-source python package on Github and PyPI.