TCGAancestry

Global ancestry estimates for the TCGA panCancer cohort from ADMIXTURE

Summary

The Cancer Genome Atlas (TCGA) is a vital resource in molecular cancer research. Opportunities to conduct cancer health disparities research from this resource are currently limited by incomplete data capture for self-reported race. Moreover, self-reported measures have known limitations, such as binning mixed race individuals into a single racial group which may not reflect their genetic make-up and thus risk. Therefore, we estimated global ancestry for all available TCGA samples according to standardized populations from 1000 Genomes.

Samples

For all available sample types (primary solid tumor, blood derived normal or other), genotypes were downloaded from TCGA’s Legacy Archive. In total there were 22,963 samples from 11,127 TCGA participants over 30 cancers included.

Supervised Analsyis

ADMIXTURE software was used to estimate ancestral proportions from each of the five 1000 Genomes global super populations. Phase 3 samples from 1000 Genomes (n = 2504) were used as reference.

Super populations:

African (AFR)
Admixed American (AMR)
East Asian (EAS)
European (EUR)
South Asian (SAS)

Main Data

admixture_calls.txt
- ID - TCGA ID
- POP - dominant super population
- EUR:AFR - ADMIXTURE global ancestry estimates for 5 super populations
- tissue - tissue type
admixture_calls_se.txt
- ID - TCGA ID
- EUR:AFR - standard errors from 200 boostrapped replicates
- tissue - tissue type
admixture_calls_by_chr.txt and admixture_calls_se_by_chr.txt
- Contain same information as admixture_calls.txt and admixture_calls_se.txt but also include chromosome for each set of results

Additional Data Resources

entropy.txt
- ID - TCGA ID, entropy - Shannon's entropy, tissue - tissue type
supervised_snp_list.txt
- Approximately 700,000 variants that overlapped between TCGA and 1000 Genomes used for ancestry estimation
- X1 - chromosome, X2 - SNP name, X3 - Position, X4 - base-pair coordinate, X5 - allele 1 (usually minor), X6 - allele 2 (usually major)
blood_derived_normal_pca.txt, primary_solid_tumor_pca.txt, and other_tissues_pca.txt
- First 20 PCs by tissue type (analysis performed in plink)
- ID - TCGA ID, tissue - tissue PCA performed in, PC1:PC20 - first 20 PCs in order
Data are also available at OSF

Code

stepByStep
- contains step by step instructions for downloading/cleaning files and running ADMIXTURE
stepByStepSupervised
- incorporating 1000 Genomes data and performing the supervised analysis

Collaborators

Jordan Creed

Travis Gerke

Contact Information

Any questions or comments concerning the data or processes described in this repo can be directed to Jordan Creed @ Jordan.H.Creed@moffitt.org or Travis Gerke @ Travis.Gerke@moffitt.org.

GerkeLab / TCGAancestry