GerkeLab / TCGAancestry

Admixture estimates for the TCGA panCancer cohort

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TCGAancestry

Global ancestry estimates for the TCGA panCancer cohort from ADMIXTURE

Summary

The Cancer Genome Atlas (TCGA) is a vital resource in molecular cancer research. Opportunities to conduct cancer health disparities research from this resource are currently limited by incomplete data capture for self-reported race. Moreover, self-reported measures have known limitations, such as binning mixed race individuals into a single racial group which may not reflect their genetic make-up and thus risk. Therefore, we estimated global ancestry for all available TCGA samples according to standardized populations from 1000 Genomes.

Samples

For all available sample types (primary solid tumor, blood derived normal or other), genotypes were downloaded from TCGA’s Legacy Archive. In total there were 22,963 samples from 11,127 TCGA participants over 30 cancers included.

Supervised Analsyis

ADMIXTURE software was used to estimate ancestral proportions from each of the five 1000 Genomes global super populations. Phase 3 samples from 1000 Genomes (n = 2504) were used as reference.

Super populations:

  • African (AFR)
  • Admixed American (AMR)
  • East Asian (EAS)
  • European (EUR)
  • South Asian (SAS)

Main Data

Additional Data Resources

  • entropy.txt
    • ID - TCGA ID, entropy - Shannon's entropy, tissue - tissue type
  • supervised_snp_list.txt
    • Approximately 700,000 variants that overlapped between TCGA and 1000 Genomes used for ancestry estimation
    • X1 - chromosome, X2 - SNP name, X3 - Position, X4 - base-pair coordinate, X5 - allele 1 (usually minor), X6 - allele 2 (usually major)
  • blood_derived_normal_pca.txt, primary_solid_tumor_pca.txt, and other_tissues_pca.txt
    • First 20 PCs by tissue type (analysis performed in plink)
    • ID - TCGA ID, tissue - tissue PCA performed in, PC1:PC20 - first 20 PCs in order
  • Data are also available at OSF

Code

  • stepByStep
    • contains step by step instructions for downloading/cleaning files and running ADMIXTURE
  • stepByStepSupervised
    • incorporating 1000 Genomes data and performing the supervised analysis

Collaborators

Jordan Creed

Travis Gerke

Contact Information

Any questions or comments concerning the data or processes described in this repo can be directed to Jordan Creed @ Jordan.H.Creed@moffitt.org or Travis Gerke @ Travis.Gerke@moffitt.org.

About

Admixture estimates for the TCGA panCancer cohort