Decoding how gene expression is regulated is critical to understanding disease. Regulatory DNA is decoded by the cell in a process termed “cis-regulatory logic”, where proteins called Transcription Factors (TFs) bind to specific DNA sequences within the genome and work together to produce as output a level of gene expression for downstream adjacent genes. This process is exceedingly complex to model as a large number of parameters is needed to fully describe the process (see Rationale, de Boer et al. 2020; Zeitingler J. 2020).
Having the ability to understand cis-regulatory logic in the human genome is an important goal and would provide insight into the origins of many diseases. However, learning models from human data is challenging due to limitations in the diversity of sequences present within the human genome (e.g. extensive repetitive DNA), the vast number of cell types that differ in how they interpret regulatory DNA, limited reporter assay data, and substantial technical biases present in many omic methods. To overcome these issues, we have recently created high-throughput measurements of the cis-regulatory activity of millions of randomly generated promoters in the single-cell organism Yeast (de Boer et al. 2020). Here, the expression level generated by each promoter sequence is measured via a fluorescent reporter gene regulated by a promoter (Sharon et al. 2012). The set of randomly generated promoter sequences is so large that it rivals the complexity of the entire human genome, which gives us unprecedented power to learn the many parameters required to understand gene regulation (see Rationale). Because both human and Yeast cis-regulatory logic uses similar principles, we hope that the model architectures learned on yeast data can inform on how to create models for the human genome.
In this competition, the participants will be given expression measurements of millions of randomly generated promoter sequences to train machine learning models that predict gene expression from sequences. The participants will be provided with TPU Research Cloud resources to help train their models.