Assistant Prof. for Computational Biology Julien Gagneur
Fakultät für Informatik
Technische Universität München
The CADD score
The interpretation of genetic variants is of tremendous importance in human genetics to prioritize those genetic variants, which could be at the origin of a particular genetic disorder, among the millions any individual has. The CADD (Combined Annotation Dependent Depletion) is a genetic variant scoring system that assess the deleteriousness of single nucleotide variants as well as insertion and deletion variants, and coding and non-coding variants. It is the most popular genetic variant scoring tool used to prioritize causal variation in both research and clinical settings. The CADD score (C-Score) is obtained by fitting a logistic regression to differentiate approximately 15 million human variants that have been fixed since the split between human and chimpanzees from approximately 15 million simulated variants. The predictive features are 60 different annotations derived from the Ensembl Variant Effect Predictor, conservation and selection scores, and epigenetic information from the ENCODE and RoadMap projects.
The model repository Kipoi
While public databases have been developed for easy storage and access to genomic data, there has been a lack of analogous repositories for computational models in genomics. Models are implemented in various programming languages and machine learning frameworks, stored in diverse formats and made available through different channels, such as code repositories and supplementary material of articles. Even with the availability of reliable code, replicating large models trained on large datasets can be challenging. Borrowing ideas from model zoos introduced in other application domains, we have developed Kipoi (https://kipoi.org), an API and model repository for genomics. The Kipoi repository contains currently over 2,000 trained models that cover canonical prediction tasks in transcriptional and post-transcriptional gene regulation. The Kipoi model standard grants automated software installation and provides unified interfaces to apply and interpret models. In particular, we have developed a generic plugin leveraging the Kipoi to perform variant effect prediction for any Kipoi model that take DNA-sequence as input.
- Features and model complexity. The model used to train CADD is a logistic regression, the most simple type of classifier. We might benefit from more complex machine learning models. Also the set of features is limited. With Kipoi as a backend we can extend the set of CADD features to include the prediction of machine learning models for regulatory genomics.
- Scalability. The CADD score is based on a logistic regression. We plan to fit CADD with the Adam optimizer (see below). Assuming linear increase of CPU time with respect to the number of samples and features (Adam is a gradient-based method), as well as 30 million variants and 5,000 unique features (i.e. including current CADD features and Kipoi features), fitting a single model would take about 25,000 CPU hours for the entire data set. Training such a model on a typical 24-core server would take about 40 days if all data could be kept in memory. Moreover, storing 30 Mio. observations with 5,000 features each and requiring double precision (i.e. four byte per value) would need about 600Gb of RAM, exceeding the typical RAM capacity of a server. Hence, developing a distributed learning scheme for CADD scores and future extensions of it is essential.