Data types and storage in the 100,000 Genomes Project

Clinical, laboratory and health data flows from a number of NHS, Social Care and research organisations to Genomics England.

Firstly the data flows from the hospital or clinic that recruited the participant. This data is specific to their participation in the 100,000 Genomes Project.

Additional data is collected, stored and analysed. This comes from medical notes and health records from a participants GP, hospital and other sources like national disease registries. We use this information together with donated samples for scientific or medical purposes. It is also used for research into medical conditions. We also collect data from:

• NHS Digital – the national provider of information, data and IT systems for health and social care in the UK.
• Public Health England.

More details on these are below.

Why do we collect health data?

To analyse genomes properly, we need as much detail as possible about a participant’s medical condition and symptoms. To get this data, we’ll need to send some details about participants (for example, NHS number and date of birth) to the organisations holding this information. This will allow them to find the health data they hold. This data might include test results, scans, medicines being taken, the age at which a person developed particular symptoms and so on. We call this ‘health data’. We carry on collecting it so we can monitor how a condition progresses with time.

This might include test results, scans, medicines being taken, the age at which a person developed particular symptoms and so on. We call this ‘health data’. We carry on collecting it so we can monitor how a condition progresses with time.

This information is important because even small differences in symptoms between individuals might be crucial in finding the change in a genome and help decide the best treatments. This health data comes from the specialist teams and hospitals that are treating you.

We also collect as much other general medical data as possible from a participants medical records, over the whole of their life. We do this for two reasons.

First it’s critically important to find out how things turn out for participants health wise. For instance, did they get better and die of something else or did they get much worse very quickly. Knowing this will help us find the gene changes that are associated with poor outcomes so that we can predict which people need the most powerful treatments.

Second, we collect all sorts of data, even things that at first look might not have any relevance to a health condition. This is because we don’t yet know what is important. For instance, we collect details about birth and childhood illnesses because these might – or might not – have an influence on a condition. While some information we collect may not be relevant for an individual, it might be very important in other people’s conditions. For instance, we collect information about mental health and disability which is an important symptom for many of the rare conditions we cover.

Where is the data from?

We get data from many places. The main ones are:

NHS Digital

Hospital Episode Statistics (HES) .  These come from all NHS trusts in England, including acute hospitals, primary care trusts and mental health trusts. This data is collected during a patient’s time at hospital. It includes details of diagnosis, treatment received and other details about the patient. HES information is stored as a large collection of separate records – one for each period of care.

Patient Report Outcome Measure Data (PROMs) This data measures health gain typically in patients undergoing hip or knee replacement  but is also used to measure a patient’s health or health-related quality of life at a single point in time.  This dataset lets us see whether there is any relationship between a gene variant and how well patients do after treatment or surgery.

Mental Health and Learning Disability Set (MHLDS)  This is used to analyse patient pathways and enable a deep understanding of mental health service users’ interactions with acute secondary care.  For the project it is a particularly important data set as many participants with rare conditions have learning disabilities. But it also helps us see whether mental health problems such as depression are an unrecognised symptom of a genetic variant.

Diagnostic Imaging Dataset (DID) is a central collection of detailed information held by NHS Digital about diagnostic imaging tests carried out on patients, such as XRays and MRI scans. This data can provide insights into whether particular gene variants are associated with a particular tumour or condition.

Mortality Data. Cause of death is a crucial piece of information.  If a patient dies of a different illness, or in an accident,  we need to record this as otherwise it will skew the data. And if people die earlier, or live longer than expected it will help us pick out the variants associated with this.  Having mortality data will also reveal if people with one condition die disproportionately from another, such as heart attacks. Approval to link to this data with Office of National Statistic Data (which records deaths) must be granted by both NHS Digital and the ONS.

Public Health England

Participants  in the 100,000 Genomes Project cancer programme give their explicit consent to allow their patient records to be linked to the data collected by the National Cancer Registration Service in Public Health England.  More information on the data collected by the National Cancer Registration Service and how it is used is available at There are also details of how to opt out of the National Cancer Registration Service at

How is the data stored?

Our research data is de-identified for each and every participant. Their name, date of birth and all other personal details are stripped away. Each person’s data is assigned a unique code which allows the project data team to track it and keep it safe and to re-identify it if we need to return findings to someone at a much later date. The researchers who look at the data can only see de-identified data.

The de-identified data is moved to the data centre. Approved researchers can access this data for their work.

Diagram showing an outline of the data storage and analysis systems for Genomics England

Outline of the data storage and analysis systems for Genomics England.

Share thisShare on FacebookShare on Google+Tweet about this on TwitterShare on LinkedInEmail this to someone