Population Genomics GeCIP domain


Project Summary 

Genome sequences from the 100,000 Genomes Project give an unprecedented genetic resource for the British population.  The population genomics GeCIP domain brings together leading UK researchers and collaborators from around the world. They will use the data to understand better how genetic variation has occurred in England. They will also generate information to facilitate other genetic analyses.


The researchers aim to improve understanding of the history of genetic mutation, recombination and genetic drift that has given rise to the genomes in the current population.  The results will inform them about the ancestry of samples, help fill in missing data for disease-oriented genetic studies, and improve the understanding of the recurrence risk for genetic disease caused by new mutations.  They will also combine geographic, epidemiological and genetic data to investigate the importance of these different factors in causing disease.

Below are the current subdomains for this domain. You can find the full details of the research proposed by this domain in the Population Genomics Detailed Research Plan.

Genetic population resourcesJonathan Marchini
Richard Durbin
The GEL dataset will constitute the largest human genetic variation resource ever collected in the UK, and maybe the world. Over the last 10 years resources such as these (HapMap (1) , 1000 Genomes (2) , UK10K (3)) have been widely used by the whole human genetics community for the purposes of genotype imputation into genome-wide association studies (GWAS), characterization of population structure and population genetics. We will bring together experts in genome sequence informatics and statistical genetics to generate a set of derived data sets and analysis tools from the 100,000 Genomics England (GEL) genome sequences. The outputs will have high value for studies of human genetics and human disease using both the GEL subjects themselves and third party data. The top level aims are:
1. Phase the genome sequences to generate the world’s largest haplotype reference panel, empowering very low frequency imputation for future genome wide association studies, via imputation server resources.
2. Characterize the fine-scale genetic structure of the English population at an unprecedented level, providing knowledge for population structure matching and adjustment for disease studies. To succeed, we will pool expertise across the aims to build new methods that can handle the unprecedented scale of the dataset, and for the first time combine phasing and population structure analysis in one single step.
Develop and apply reference-free assembly/ alignment methods to the primary read data to characterise more divergent variation and (in the longer term) build a graph-based deep variation reference sequence resource for the English population.
Next generation linkage disequilibrium mapsSarah EnnisGenetic maps are an essential tool for mapping disease genes. These are maps of our genome that capture information on recombination and other factors that must be considered when trying to accurately identify the genomic location of genetic changes causing disease. The first such maps were linkage maps and these were the tools successfully exploited in all linkage studies of families to uncover genes underlying single gene disorders. Following the Human Genome Project, a new age of linkage disequilibrium (LD) maps uses SNP data to massively improve the resolution of genetic maps. LD maps are centrally important for the design and interpretation of genetic associations with disease. In the era of high throughput genomics, it is now possible to create Next Generation LD Maps that are optimally resolved and serve as essential tools for the mapping and interpretation of genomic variation impacting health.
The University of Southampton Genetic Epidemiology & Genomic Informatics Group has a long standing history and expertise in genetic map development and application (4). The group is already working on one of the largest human whole genome sequence datasets (n = 500) of healthy elderly people (http://www.scripps.org/research__areas-of-research__genome-and-genomic-medicine-research__wellderly-study) to generate these maps. The wealth of data to be generated by the 100,000 Genomes Project will advance LD maps development by virtue of the large sample sizes of multi-ethnic groups across different disease areas. The maps created from the whole genome data of patients contributing to 100,000 Genomes Project can be used for high resolution association mapping, refining previously identified gene regions, detecting novel genomic rearrangements in cancer and understanding relationships between LD and disease.
Germline mutationMatthew Hurles
Aylwyn Scally
New (de novo) mutations observed in children are a major cause of rare genetic diseases. The focus of this subdomain is on characterizing the rate, spectra and timing of germline de novo mutations (DNMs) observed in the rare disease families sequenced by GEL, and using this information to: (i) better understand the underlying mutation processes (ii) develop better methods for identifying DNMs, and (iii) use improved understanding of parental mosaicism and germline cellular genealogies to derive better recurrent risk estimates for genetic disease.
It has been shown that at the population level, the main factor increasing the number of DNMs seen in a child is paternal age (5). However, this is only true for some classes of DNMs (e.g. base substitutions) and not others (e.g. deletions caused by non-allelic homologous recombination) (6). It is likely that other factors, including both genetic variation and environmental exposures, also influence locus-specific and genome-wide mutation rates (as has been observed for somatic mutation processes). The generation by GEL of deep whole genome sequence data on thousands of parent-offspring trios presents a major opportunity to characterize these factors with unprecedented power.
Population diversityJean-Baptiste CazierThe 100K initiative that collects both sequence and clinical data from across the country provides a unique opportunity to address several classic genetic epidemiological issues. Such phenotypically complete, and therefore complex, data will allow us to disentangle genetic background from other geographic, socioeconomic or cultural factors to properly address questions related to Population Diversity. Whether we are looking at Cancer or Rare Diseases there is some impact from the genetic background; it can lie at the Caucasian/non Caucasian boundary (7), or within the diversity of the white British population as demonstrated by the People of the British Isles study 8). While the West Midlands Genomic Medicine Centre is ideally placed for access to a region of a strong ethnic diversity, we are also interested in the subtler differences that exist within the "indigenous" British population (9). In the case of cancer, genetic background can lead to a diversity of incidence, as well as progression and response to treatment. However the same can also be said of "environmental" factors and having access to such longitudinal clinical data from across the country should prove useful in disentangling these interrelated effects. We would also aim to associate somatic changes to such "environmental" factors. Interestingly similar effects could also be true for rare diseases where a specific "isolated" ethnic group might lead to increased consanguinity and therefore risk. Rare dominantly acting variants are often ‘founders’ which can also lead to marked local differences in frequencies. For recessive diseases associated socioeconomic and cultural factors can play a very important role in the incidence, report, prevention, monitoring and progression of the disease, independently or not, of the ethnicity itself. Spatial data (locational metadata associated with disease and genetics data) requires unique data handling techniques to preserve the anonymity of individuals and groups (such as ethnic communities), as location can be indicative of identity, in breach of ethics. We therefore intend to work on 3 main branches, across all the GeL proposed diseases:
- Genetics: Study of the genetic impact on incidence, progression, response to treatment.
- Environmental impact on incidence, report, progression and response to treatment. In "Environment" we would include diet, smoking, culture and broader socioeconomic and geographic factors such as deprivation, education level, spatial attributes including access to healthcare facilities, rural/urban location, the north-south divide, the toxicity of the area, etc.
- Ethical impact: how ethnicity impacts the geographical reporting of disease; ethics of spatial data handling and locational privacy; spatial clustering and identity/anonymity.


1) International HapMap Consortium et al. A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851–861 (2007).
2) 1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
3) UK10K Consortium et al. The UK10K project identifies rare variants in health and disease. Nature (2015). doi:10.1038/nature14962
4) Maniatis, N., A. Collins, C-F. Xu, L. C. McCarthy, D. R. Hewett, W. Tapper, S. Ennis, X. Ke, and N. E. Morton. “The first linkage disequilibrium (LD) maps: delineation of hot and cold blocks by diplotype analysis.” Proc Natl Acad Sci USA 99, no. 4, 2002: 2228-2233.
5) Kong A, Frigge ML, Masson G, Besenbacher S, Sulem P, et al. (2012) ‘Rate of de novo mutations and the importance of fathers age to disease risk’. Nature 488: 471-475.
6) MacArthur JA, Spector TD, Lindsay SJ, Mangino M, Gill R, et al. (2014) ‘The rate of nonallelic homologous recombination in males is highly variable, correlated between monozygotic twins and independent of age’. PLoS Genetics 10: e1004195.
7) A Common Variant Associated with Prostate Cancer in European and African Populations. Amundadottir, et al. (2006) Nature Genetics, 38 (6), 652-8.
8) People of the British Isles: preliminary analysis of genotypes and surnames in a UK-control population. Winney B, et al. (2012) Eur J Hum Genet.;20(2):203-10
9) Gene-Ancestry Interactions in Genome-Wide Association Data. Davies JL*, Cazier J-B*, et al. (2012) PLoS One, 7(12),

Other Projects