The dataset

The 100,000 Genomes Project dataset will be made available to GeCIP researchers and trainees free of charge. Access is via a secure analysis environment hosted within the Genomics England datacentre – the Genomics England ‘Research Environment’. Analytical tools and applications are available within the Research Environment.

The dataset includes de-identified, linked information for each participant:

  • Genome sequence data
  • Variant call files
  • phenotype/clinical data
  • HES data

How to access the dataset

To view and work with the data, GeCIP members need to first ensure their institution has signed a Participation Agreement (opens as PDF).

This contract between Genomics England and your institution outlines obligations and responsibilities of the institution with respect to your participation in the Project.

Your institution may have already signed this agreement if others from your institution are a part of the Project. If your institution hasn’t signed, you will be contacted with the necessary information and documentation. Your institution will be asked to confirm your identity before you are sent your login details via email. View Participation Agreement FAQs for more information.

Viewing the dataset

Access to the 100,000 Genomes dataset is via a remote desktop hosted by the Genomics England datacentre – the Research Environment. After logging in, a desktop will open in a window of your internet browser. This looks much like a normal Linux desktop. The Research Environment is preloaded with tools and applications, plus genomes and associated clinical data.

Data security

All data access is through the Research Environment only. No sequencing or clinical data will be made available for download.

To preserve data security, you cannot copy or paste into or out of the Research Environment, and there is limited internet access within it. Movement of files into and out of the Research Environment will be via an ‘airlock’ system.

Computing power

Each GeCIP member will have a finite number of central processing unit (CPU) hours per month. The exact allocation is yet to be defined, and may change as the project and datacentre mature. Use of CPU beyond this allocation will require payment.