About 97% of all mutations in cancer occur in non-coding regions of the genome. These regions are packed with cis-regulatory elements (CREs) such as promoters, enhancers. Non-coding mutations within these CREs can affect gene expression of cancer-relevant genes, thereby classifying them as non-coding driver mutations. Although anecdotal examples exist for these driver mutations, global, genome-wide statistical based methods aimed at identifying these driver mutations have led to a limited number of discoveries. This is partially because non-coding mutations are inherently more difficult to interpret than coding mutations. The goal of the PERICODE consortium (consisting of labs from NKI, UMCU, Groningen and A-UMC) is to develop a computational algorithm that can predict the impact of non-coding mutations from DNA sequence alone.
Towards this goal, we combine a functional high-throughput assay with a residual neural network to predict and study the effects of non-coding cancer mutations on gene-expression. We use a massively parallel reporter assay (MPRA) in which millions of unique DNA fragments of ~300 bp can be tested for their promoter activity. Remarkably, the neural network trained on these MPRA data is able to accurately predict promoter activity. As our model is able to predict promoter activity from any sequence up to 600 bp, it can make predictions of the effects of any non-coding mutation in the genome on local promoter activity. Initial benchmarking the CNN with available mutagenesis data indicates that our model can indeed predict such effects quite well (R in the range of 0.5). We are implementing an integrated wet-lab / dry-lab strategy to further improve and rigorously characterize this performance, and to apply the methodology to a broad diversity of tumor types. This should provide fundamental insights into the overall impact of non-coding mutations in cancer, and enable the discovery of novel variants relevant for specific cancers.