Skip to main content

Diversity and genetic ancestry effects in the Cancer Cohort of the 100,000 Genomes Project

By T. Nguyen, Karoline Kuchenbaecker, Matt Silver, Sam Tallman, Yoonsu Cho, Alona Sosinsky, John Ambrose, Loukas Moutsianas, Maxine Mackintosh on


As part of an ongoing initiative to improve health equity in personalised patient care in England, we are reviewing the 100,000 Genomes Project for potential biases between groups with different ancestries.

The 100,000 Genomes Project was launched in 2013 by Genomics England in close partnership with the NHS. It aimed to study the potential benefits of whole genome sequencing for patients with cancer and rare diseases and to provide a resource for research into these conditions. Using sequencing, clinicians can identify genetic variants in the genome which may provide therapeutic and diagnostic insights. The 100,000 Genomes Project developed bioinformatics pipelines to automate the detection and prioritisation of genetic variants for use in clinical care.

A previous blog explored the diagnosis of rare diseases in the context of genetic ancestry. Here, we present results from the 100,000 Genomes Project Cancer Cohort. In the first part we explore how representative the Cancer Cohort is in terms of cancer rates for different ethnicities in England. In the second part, we investigate the impact of patient ancestry on the prioritization of genetic variants in the cancer bioinformatics pipeline.

Key findings

In this analysis, we found that ethnicity in the Cancer Cohort is largely representative of England when accounting for different cancer rates across ethnic groups. However, small sample sizes in some cancer types limited our ability to detect differences. Exceptions are breast and prostate cancers, where Black (both cancer types) and Asian (breast cancer only) ethnicities were recruited in higher proportions.

We found that more inherited (germline) genetic variants were prioritised as potentially relevant to cancer for individuals of non-European ancestry compared to those of European ancestry. On the other hand, fewer actionable tumour mutations were prioritised for individuals of non-European ancestry.

This highlights that further genomic research in people of diverse ancestries is key to ensuring equitable outcomes from the Genomics England cancer pipeline. Details on these findings are presented below.

How representative is the 100,000 Genomes Project Cancer Cohort?

To assess how well the Cancer Cohort represents the ethnic composition of cancer cases in England, we compared them with recent cancer incidence rates in Public Health England (PHE) data published by Delon et al. We used ethnicity for these comparisons since PHE releases cancer incidence by ethnicity. Though ancestry and ethnicity are commonly used interchangeably, genetic ancestry refers to genetic history and can only be inferred through genetic analysis, whereas ethnicity typically refers to an individual’s cultural identity and is self-reported.

The Delon et al. study found differences in cancer rates for multiple cancers between Black and White, and Asian and White ethnicities (Figure 1A & Figure 1B, pink data points). For example, lung cancer is less frequent in Black women than in White women, as indicated by the pink point falling below a ratio of 1 in Figure 1A. Prostate cancer is more common in Black men than White men with the pink point above 1. Incidence rates for many cancers are lower in people of self-reported Asian ethnicity (Figure 1B, pink data points).

Figure 1. Cancer incidence rate ratios in England vs 100,000 Genomes Project. A. Cancer incidence rate ratios of Black to White ethnicity. B. Cancer incidence rate ratios of Asian to White ethnicity. Pink dots indicate incidence rate ratios (IRR) for England from PHE data as reported by Delon et al. Blue dots are the ratios for the 100,000 Genomes Project Cancer Cohort. Error bars show 95% confidence intervals for the age-adjusted IRR. *** indicates a statistically significant difference between PHE and 100,000 Genomes Project Cancer Cohort IRR at alpha=0.05 after correction for multiple tests across cancer types. Categories with low counts not displayed.

What does that mean for the Genomics England Cancer Cohort? For it to be representative, we would not expect it to have equal ethnicity proportions across cancers. Instead, the pattern should follow the national incidence rates.

In most cases we find no evidence that ethnicity in the Cancer Cohort is different from what we would expect based on the national statistics. However, we do find a higher proportion of Black vs White men with prostate cancer compared to the PHE average, and of both Black and Asian women with breast cancer (Figure 1A & 1B, blue data points). Note that minority ethnicities are present in low numbers for many cancer types (Figure 2B) which reduces the statistical power to detect differences.

Does ancestry affect the genomic testing results?

For the next analysis, we focussed on genetically-inferred ancestry groups (hereafter ‘ancestry groups’ unless otherwise stated), as we wanted to see if genetic differences linked to ancestry influence the identification of genetic variants relevant to patient clinical care.

The ancestry group for each participant was assigned according to their genetic similarity to reference data from five super-populations defined by the 1000 Genomes Project, namely ‘European’, ‘African’, ‘Ad-mixed American (hereafter ‘Americas’), ‘East Asian’ and ‘South Asian’. Defined in this way, the largest ancestry group in the Cancer Cohort is European (87.8%) and the 2nd largest is “Unassigned” (4.2% of the cohort). We note that ancestry group assignments depend on the choice of reference populations and on the threshold used to indicate similarity to the reference. The benefits of using reference data on an expanded number of sub-populations are discussed in a separate blog.

Participant numbers broken down by cancer type, ancestry group, self-reported ethnicity and sex are shown in Figure 2. We found that more males than females were recruited for most cancer types affecting both sexes, consistent with more men being diagnosed with cancer than women in the UK.

We also found that patients of European ancestry tend to be older than patients of other ancestries which is in line with the general pattern found in England.

Figure 2.  Key variables by Cancer Type: A. Proportion of individuals assigned to each genetically-inferred ancestral super-population group in the 100,000 Genomes Project Cancer Cohort. B.  Proportion of individuals for each cancer type by self-reported ethnicity using ethnic groups defined by the UK Office for National Statistics. C.  Sex distribution. Patients were assigned a sex if their self-reported sex matched their chromosomes, otherwise they were assigned Indeterminate sex. D. Total number of patients. To protect anonymity, ancestry, ethnicity, or sex sub-groups containing fewer than 5 patients for a cancer type are excluded from percentage calculations. Cancers with fewer than 100 patients are not displayed.

Using data from the 100,000 Genomes Project, Genomics England has developed automated processes known as bioinformatics pipelines that sift through millions of genetic variants in a patient’s genome to pick out the few with potential clinical relevance. These are known as the prioritised variants. Clinicians can then tailor care to individual patients by reviewing the list of prioritised variants output by the pipelines.

We posed the question: Does this list of prioritised variants depend on a patient’s ancestry?

Some genetic variants linked to cancer are inherited from birth and are present in all cells. These variants can make it more likely that a cancer occurs at some point in life. These are known as germline variants. Well known examples of genes with germline variants are BRCA1 and BRCA2 (Dite et al.).

It is important to distinguish these from somatic variants, which are only present in the cancer cells and may be responsible for their abnormal behaviour. Somatic variants can help diagnose cancers, predict disease progression, and inform which particular drug may be most effective for the individual patient. By analysing the different types of somatic variants, clinicians can therefore offer more personalised cancer care.

Ancestry effects on prioritised germline variants

The cancer bioinformatics pipeline classifies clinically relevant germline variants into tiers.

  • Tier 1 indicates likely pathogenic variants in genes strongly linked to the patient cancer type.
  • Tier 3 indicates protein-altering/truncating variants in genes associated with any cancer type.

We found evidence that patients with non-European ancestry have more Tier 1 and Tier 3 variants than patients with European ancestry (Figure 3). For example, patients with African ancestry had a median number of 3 Tier 3 variants compared to 1 for patients with European ancestry. This difference remained significant when accounting for sex and cancer type.

The precise reasons for the varying numbers of prioritised germline variants observed for different ancestries are not yet established. They could be linked to insufficient filtering out of benign variants that are common in diverse ancestry groups but rare in European ancestries. This was suggested by a similar analysis conducted on rare disease participants from the 100,000 Genomes Project.

The observed differences in Tier 1 variants across ancestries could affect clinical management. It is likely that the higher number for non-European ancestry groups represents variants of unknown significance (VUS), potentially making it harder for clinicians to identify actual pathogenic variants.

In summary, our analysis suggests that the results of cancer germline variant prioritisation depend on patient ancestry. They may need to be revisited to ensure an appropriate balance between the detection of true pathogenic variants and incorrect classification of variants of unknown significance or non-pathogenetic variants.

Figure 3. Effect of ancestry on the number of Tier 1 and Tier 3 germline variants in the Cancer Cohort. A. Percentage of patients with at least one Tier 1 variant by ancestry. B. Distribution of Tier 3 variants by ancestry. The median number of variants is marked with a vertical black bar. *** indicates a significant difference between the number of variants for a given ancestry compared to individuals of European ancestry (model accounting for sex and cancer type, adjusted for multiple comparisons). Ancestries with low counts not included.

Does ancestry affect the results for genetic testing of somatic variants?

We next considered whether a patient’s ancestry affects the number of identified somatic variants present in the tumour that are prioritised as clinically relevant. The cancer bioinformatics pipeline classifies these into domains as follows:

Domain 1: Variants in genes affecting diagnosis, prognosis or treatment of the patient’s cancer.
Domain 2: Variants in genes implicated in any cancer.
Domain 3: Potentially protein altering variants found in any other genes.

We found evidence that patients of non-European ancestry tend to have fewer Domain 1 variants, and fewer variants from all domains combined (Domain 1, 2 and 3), compared to those of European ancestry (Figure 4). We also found that men had fewer somatic variants on average than women. Our analysis accounted for this, and for the effect of cancer type on domain variant numbers.

Figure 4. Effect of ancestry on the number of domain variants in the Cancer Cohort. A. Distribution of the number of Domain 1 variants by ancestry. The median number of variants is marked with a vertical black bar. B. Distribution of the number of all domains (Domains 1, 2 and 3) variants by ancestry. *** indicates a significant difference between the number of variants for a given ancestry compared to individuals of European ancestry (model accounting for sex and cancer type, adjusted for multiple comparisons). Although the median total Domain 1 somatic variants is similar in each ancestry, once sex and cancer type are accounted for, patients in each non-European ancestry have fewer Domain 1 somatic variants than patients of European ancestry. Ancestries with low counts not included.

It is possible that differences in the numbers of domain variants are at least partially due to ancestry differences in tumour subtypes, since higher numbers of somatic variants were found in tumours of higher grade (abnormality), and there was an association between ancestry and tumour grade in some cancer types (Figure 5). For example, sarcoma tumour grade was significantly lower in individuals of South Asian ancestry compared to the European ancestry group in the Cancer Cohort, after controlling for age, social deprivation, and sex. Also, breast cancer tumour grade was higher in women of African ancestry compared to women of European ancestry (Figure 5), in line with previous findings (Gathani et al.).

Figure 5. Tumour grade distributions by ancestry and cancer type. Samples that lack grade information are excluded, as are ancestry and cancer type combinations with fewer than 20 samples, and cancer types with fewer than 200 samples. Cancers with fewer than 3 ancestries satisfying these thresholds are not displayed. *** indicates a statistically significant difference at alpha=0.05 after correction for multiple tests.

We note that tumour grade is only available for ~60% of tumour samples available in the 100,000 Genomes Project Cancer Cohort. Missing data makes it difficult to assess the mediating role of tumour features on domain variants, particularly in the case where missing data is not randomly distributed across ancestries.

Further research is therefore required to better understand the mechanisms that lead to the observed differences in the number of domain variants. Factors such as access to care and comorbidities may also influence these relationships.

Conclusion and next steps

We found that for most cancer types, the distribution of ethnicities in the Cancer Cohort was similar to those reported for England as a whole. Exceptions are breast and prostate cancers where Black (both cancer types) and Asian (breast cancer only) ethnicities were recruited in higher proportions into the Cancer Cohort.

Reassuringly, this implies that there is currently no evidence of underrepresentation of minority ethnic groups in the Cancer Cohort that formed the pilot for the incorporation of whole-genome sequencing into routine cancer healthcare in the NHS. This positions the 100,000 Cancer Cohort as a unique resource given that other cancer cohorts have been severely affected by underrepresentation (Liu et al.). However, we note that our analysis had limited power to detect differences for many cancer-ethnicity group combinations where sample numbers were small.

We found evidence that more inherited germline variants potentially relevant to cancer were prioritised in individuals of non-European ancestry. This may be linked to insufficient filtering of likely benign variants which could potentially impact clinical management. In contrast, we found that fewer somatic variants were prioritised by the cancer bioinformatics pipeline in individuals of non-European ancestry. This could be linked to differences in tumour subtypes or other factors related to access to care or comorbidities, but we did not have sufficient data to assess this.

The cancer bioinformatics pipeline is constantly evolving and has undergone several improvements since the 100,000 Genomes Project, incorporating updates to external resources such as clinical databases. Recently, the pipeline has been modified to process data from individuals with cancer from the Genomic Medicine Service (GMS) in England.

The GMS aims to expand the use of genome sequencing as a routine diagnostic in cancer care in England. Continuing research into the effects of genetic ancestry on the performance of the bioinformatic pipelines is therefore crucial to ensure equitable health care for England’s diverse population.

Statistical methods

For preliminary analysis, likelihood ratio tests (LRTs) were used to test for an overall effect of ancestry on numbers of prioritised variants by comparing baseline models adjusted for sex and cancer type to the same model with an additional categorial covariate for ancestry group. Significant LRTs (p<0.05) were followed-up with regression modelling to determine the effects for individual ancestries. In each case, significant ancestry effects were those with coefficients with p<0.05, with Bonferroni adjustment for multiple comparisons.

Logistic regression was used for Tier 1 models (0 vs 1 or more Tier 1 variants). Negative binomial models were used for all other variant types (Tier 3, Domain 1 and all domains). All covariates were modelled as fixed effects except for cancer type which was modelled as a random effect.

Additional covariates for sensitivity analyses included age, genetic diversity (het/hom ratio), total germline variants (tiered variant analysis only); and tumour grade and total somatic variants (domain variant analysis only).

All models excluded patients with indeterminate sex, haematological cancers, and/or cancers with low sample numbers (n<5). The Admixed American ancestry group was excluded from all analyses due to low sample numbers in this dataset.

Data and software availability

Code used for this analysis can be found in gitlab. The data used in this analysis is in the version 15 release of the National Genomic Research Library.

Please email [email protected] with any queries.


Delon, C., Brown, K.F., Payne, N.W.S. et al. Differences in cancer incidence by broad ethnic group in England, 2013–2017. Br J Cancer 126, 1765–1773 (2022).

Dite, Gillian S., et al. "Familial risks, early-onset breast cancer, and BRCA1 and BRCA2 germline mutations." Journal of the National Cancer Institute 95.6 (2003): 448-457.

Gathani, T., Reeves, G., Broggio, J. et al. Ethnicity and the tumour characteristics of invasive breast cancer in over 116,500 women in England. Br J Cancer 125, 611–617 (2021).;

Liu, Ying L., et al. "Disparities in cancer genetics care by race/ethnicity among pan‐cancer patients with pathogenic germline variants." Cancer 128.21 (2022): 3870-3879.

Get the latest updates straight to your inbox