Genomics England uses MongoDB to power the data science behind the 100,000 Genomes Project
Genomics England is using data platform MongoDB to power the data science that makes the 100,000 Genomes Project possible. Our partnership with MongoDB allows the processing time for complex queries to be reduced from hours to milliseconds, which means scientists can discover new insights more quickly.
Genomics England, working with the NHS, is sequencing 100,000 genomes from patients with rare diseases and their families, as well as patients with common cancer. On average, 1,000 genomes are sequenced per week, which amounts to around 10 terabytes of data per day. To manage this immense and sensitive data set, Genomics England uses MongoDB Enterprise Advanced.
Augusto Rendon, Director of Bioinformatics at Genomics England, said:
MongoDB Enterprise Advanced satisfied these requirements and has been providing Genomics England with data flexibility, performance at scale and security since the project started in 2013.
Ignacio Medina, Head of Computational Biology Lab HPC Service, University of Cambridge, and Head of Bioinformatics Databases at Genomics England has been building many of the applications that sit on top of MongoDB. He said:
Two of the important projects also utilising MongoDB are Cellbase and OpenCGA (Computational Genomics Analysis). Cellbase is a data warehouse and open API that stores reference genomic data from public resources such as Ensembl, Clinvar, and Uniprot. By relying on MongoDB, Cellbase can typically run sophisticated queries in an average of 40 milliseconds or less, and complex aggregations in less than one second – down from six hours using previous filesystem-based querying and storage. Importantly, it can annotate about 20,000 variants per second, making it compatible with whole genome sequencing data throughput requirements, while also returning a rich set of annotations that helps scientists better understand the data.
OpenCGA aims to provide researchers and clinicians with a high-performance solution for genomic big data processing and analysis, and the platform includes detailed information on genomic material. This means OpenCGA has the ability to process incredibly complex queries based on a huge variety of variables. By using MongoDB, OpenCGA enables researchers to query data in a wide variety of ways, using MongoDB’s secondary indexes – from compound indexes to query data across related attributes, text search facets to efficiently navigate and explore data sets, and sparse indexes to access highly variable data structures.
Dev Ittycheria, President and CEO, MongoDB, concluded:
Find out more about MongoDB on their website.