Genomics England is using data platform MongoDB to power the data science that makes the 100,000 Genomes Project possible. Our partnership with MongoDB allows the processing time for complex queries to be reduced from hours to milliseconds, which means scientists can discover new insights more quickly.
Genomics England, working with the NHS, is sequencing 100,000 genomes from patients with rare diseases and their families, as well as patients with common cancer. On average, 1,000 genomes are sequenced per week, which amounts to around 10 terabytes of data per day. To manage this immense and sensitive data set, Genomics England uses MongoDB Enterprise Advanced.
Managing clinical and genomic data at this scale and complexity has presented interesting challenges. That’s why adopting MongoDB has been vital to getting the 100,000 Genomes Project off the ground. It has provided us with great flexibility to store and analyse these complex data sets together. This will ultimately help us to realise the benefits of the Project – delivering better diagnostic approaches for patients and new discoveries for the research community.
Director of Bioinformatics at Genomics England
MongoDB Enterprise Advanced satisfied these requirements and has been providing Genomics England with data flexibility, performance at scale and security since the project started in 2013.
Ignacio Medina, has been building many of the applications that sit on top of MongoDB. He said:
MongoDB is performing beautifully for us. From the beginning of the project it’s been fantastic for our developers to build and iterate quickly. Now that the 100,000 Genomes Project is running at scale, MongoDB is also helping us extend that great experience on to the scientists and clinicians who access the data, making it easier and faster for them to find critical insights in the data.
Head of Computational Biology Lab HPC Service, University of Cambridge, and Head of Bioinformatics Databases at Genomics England
Two of the important projects also utilising MongoDB are Cellbase and OpenCGA (Computational Genomics Analysis). Cellbase is a data warehouse and open API that stores reference genomic data from public resources such as Ensembl, Clinvar, and Uniprot. By relying on MongoDB, Cellbase can typically run sophisticated queries in an average of 40 milliseconds or less, and complex aggregations in less than one second – down from six hours using previous filesystem-based querying and storage. Importantly, it can annotate about 20,000 variants per second, making it compatible with whole genome sequencing data throughput requirements, while also returning a rich set of annotations that helps scientists better understand the data.
OpenCGA aims to provide researchers and clinicians with a high-performance solution for genomic big data processing and analysis, and the platform includes detailed information on genomic material. This means OpenCGA has the ability to process incredibly complex queries based on a huge variety of variables. By using MongoDB, OpenCGA enables researchers to query data in a wide variety of ways, using MongoDB’s secondary indexes – from compound indexes to query data across related attributes, text search facets to efficiently navigate and explore data sets, and sparse indexes to access highly variable data structures.
The 100,000 Genomes Project hits home for me in a very personal way as I recently lost my mother to cancer. I am extremely grateful that so many brilliant people are dedicating their time and energy to this important project. We are honoured that MongoDB is playing an essential role as the underlying data platform to produce data science that is likely to change the lives of millions of people, including someone we may personally know, for the better. This is the kind of project that inspires us to do our best work every day.
President and CEO, MongoDB