What is de-identified data?
In this explainer episode, we’ve asked Georgia Chan, Senior Data Wrangler at Genomics England, to explain what de-identified data is.
You can also find a series of short videos explaining some of the common terms you might encounter about genomics on our YouTube channel.
If you’ve got any questions, or have any other topics you’d like us to explain, let us know on [email protected].
You can download the transcript or read it below.
Florence: What do we mean by de-identified data?
My name is Florence Cornish, and today I'm here with Georgia Chan. Georgia is Senior Data Wrangler here at Genomics England, which just means that she cleans up and adds structure to complicated data so that it becomes usable, and she is going to be telling us much more about the topic of de-identified data.
Georgia, I think it would be a good place to start by talking about the National Genomic Research Library, which is the library that we at Genomics England store data in. So maybe you could explain more about that and what kind of data is in there.
Georgia: Sure. Thanks Florence. So, we have genomic data.
Genomic data is information that comes from a person's DNA. It helps us understand how the body works and why disease happens. This can include whole genome sequencing data, variants found in genes, small differences that make each of us unique, and information about how genes function or how they differ between people.
Genomic data does not include a person's name or who they are. It's biological information, not identity, and it's used to understand health and disease. It's really important to note that by nature, it's nature, genomic information is incredibly rich. We all have millions of common genetic variants, but your whole genome is unique to you. So although genomic data alone can't directly identify you, it still counts as personal data under data protection.
We also have clinical data. Clinical data provides real world context for the genomic data. It shows what's happening in someone's health. This can include diagnosis of a disease or a symptom, treatments that have been received, health outcomes over time, such as remission or progression, and this clinical data that help researchers see how genetic differences relate to symptoms, treatment response, and long-term outcomes.
So, we have both of these kinds of data. Genomic data on its own can be hard to interpret, and clinical data on its own only tells part of the story. Together, they allow researchers to better understand how diseases develop, helps them discover new or more targeted treatments, and it helps them improve diagnosis, care, and outcomes.
And this is why both types of this data are used together in the National Genomic Research Library.
Florence: And so, both of these data types, both clinical and genomic, we say that they are de-identified. But what exactly does that mean?
Georgia: Yes, good question. De-identified data means that information which directly identifies a person has been changed or removed from a health record before researchers can access it.
And in practice, it means that researchers cannot see who the person is. The data cannot be used to contact individuals, and a person's identity is protected by design, which means that necessary safeguards are embedded into every stage of a service or process. So, researchers work with the data, but not with people's identities.
Florence: Could you tell me a little bit more about why it's so important to de-identify data in this way?
Georgia: Sure. De-identification creates a safe middle ground. It means that data can be used to improve healthcare whilst people's privacy and trust is respected. So, without de-identification, every new research question would require individual contact and large-scale, long-term research would be extremely difficult.
With de-identification, we reduce the risk of someone being identified. We prevent inappropriate use of data, and we ensure that data is used only for approved research.
And it's important to note also that it sits alongside a list of other safeguards, so that helps ensure data is used responsibly, such as secure Research Environment, strict access control, independent ethical and governance approvals. And all of those safeguards are provided in Genomics England's Research Environment.
Florence: I think a common question that people might have, or a question that I definitely had when I first heard the term, is how de-identified data is different from anonymous data.
Georgia: Yes, it is a good question. So, anonymised data cannot be linked back to an individual and is no longer considered personal data, whereas de-identified anonymised data, it has identified as hidden from researchers, but it can still be relinked by a trusted authorised organisation if needed.
So, in healthcare research, de-identification is often preferred because it allows long-term follow up. It also allows updates as new health information becomes available, and also allows corrections or withdrawals when they occur and when they're appropriate.
Florence: So say a researcher did find something in the data that they might want to feedback, how can we re-identify that participant? What does that process look like?
Georgia: Researchers cannot re-identify participants themselves. At Genomics England, if researchers do make a new discovery that could help an individual, for example, a possible diagnosis for a rare condition, we have an in-house clinical team who can link back to that individual's details and work with their NHS clinicians to establish if this new insight can be fed back.
So if something clinically important is discovered, research is reported through a formal governance process, and then a trusted authorized team, not the researchers who re-identify the participant, and this ensures that researchers never know who the participant is and individuals remain protected.
Whilst important findings can still benefit patients, and this would only happen when it's ethically approved and clinically appropriate.
Florence: Great. Well, I think we'll finish there. Thank you so much, Georgia, for taking the time to talk us through the meaning of de-identified data and why it is so important to protect participants.
Georgia: Thank you, Flo. And let's remember that de-identified data isn't about hiding information. It's about using it responsibly.
Florence: Absolutely. If you want to hear more explainer episodes like this, you can find them on our website at www.genomicsengland.co.uk or wherever you get your podcasts. Thank you for listening.