NovembreLab / eems-merge

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Data Processing

The exact pipeline used is available on github under GITHUB_LINK. As some of the data sets are subjects to some restrictions, they cannot be made available.

Thus, this repo contains all the code used to merge repo, and below (and in the paper) we have info on how data was obtained. We can provide meta-data, however, as this is either part of publications or publicly available

How to run this

  1. obtain all the genetic data files listed below, and copy them in the correct folders (see below)
  2. obtain genotype chip info files
  3. download Snakemake, the workflow was run using version 3.5.2, and is not working on some more recent versions
  4. make sure plink is in the path
  5. install numpy, pandas, scipy for python
  6. install dplyr, yaml, rworldmap, data.table for R
  7. run snakemake pgs/gvar3.indiv_meta

Data Sources

Human Origins Data set (Lazaridis et al. 2014)

The full data set was obtained from David Reich with permission for demographic analyses. Sampling location information was obtained from table S9.4 of Lazaridis et al. 2014. We used the population information in the vdata subset of all ascertainment panels, except for the analysis where we asses ascertainment bias.The utility convert from admixtools (Patterson at al. 2012) was used to convert the data into plink format

Estonian BioCenter data

The data generated by the Estonian Biocenter (REFS) was kindly provided in plink format by Mait Metspalu on 10/30/15, along with location information where it was available. This data set contained 1282568 SNP. Of those, 6770 SNP had non-unique ids and were therefore removed.

HUGO Panasian SNP consortium

The data was downloaded on 6/24/15 from www.biotec.or.th/PASNP. Location-metadata was obtained on the same day from the map on the same website, and individuals were matched to populations using the individual identifiers. All individuals with the same tag were assigned the median of all locations from that tag. The data was first lifted onto hg19 (with 5 out of 54794 SNP being removed), and then reformated into binary plink format.

Xing et al. data

This data set was downloaded on 6/20/15 from http://jorde-lab.genetics.utah.edu/pub/affy6_xing2010/. Sampling locations were kindly provided by Jinchuan Xing. We used version 32 of the annotation file obtained from affymetrix.com to map SNP onto hg19, remove strand-ambiguous SNP and to flip SNP that were on the minus-strand.

POPRES data

POPRES data was obtained under dbGAB accession XXXX to John Novembre, and we used the data as processed in Novembre et al. (2008). We used version 32 of the annotation file obtained from www.affymetrix.com ("Mapping250K_sp.na32.annot.csv" and "Mapping250K_Sty.na32.annot.csv") to filter SNP that did not map onto hg19 and we removed strand-ambiguous AT and GC polymorphisms. Following Novembre et al. we only retained individuals for which all grand-parents were from the same country, and split up the Swiss sample according to language groups.

reich2011 data

The data were obtained on 7/14/15 from Mark Stoneking with permission for demographic analyses. After merging the three different source files, SNP not mapping to hg19 using the annotation file "GenomeWideSNP_6.na32.annot.csv" were removed, as were AT and GC SNPs. Sampling locations were extracted from Figure 1 of Reich et al. (2011)

Paschou et al. (2013)

Data was obtained on 8/13/15 in binary plink format from http://drineas.org/Maritime_Route/RAW_DATA/PLINK_FILES/MARITIME_ROUTE.zip. Sampling location information was obtained from Supplementary Table 3 in Paschou et al. (2013). SNP not mapping to hg19 using the annotation file "GenomeWideSNP_6.na32.annot.csv" were removed, as were AT and GC SNPS.

Jeong et al. (2017), Bigham et al. (2011) and Xu et al. (2012) data

This data was obtained from Choongwon Jeong and Anna Di Rienzo. We used the same filtering as in the Jeong et al. (2017) study, but only added the samples originating from these three studies with permissions from the respective authors.

Combining Meta-information

All Sources with the exception of the Estonian Biocenter data provided (approximate) sampling coordinates. However, the level of accuracy varied between sources, with some providing specific ethnicities, some (such as POPRES) only providing country information and others just providing city- or state-level information. For POPRES-derived data, and most countries, we followed Novembre et al. (2008) and assigned individuals to the countries centerpoint, with the exception of Sweden, Finland, which were assigned their capital.

For the Estonian BioCentre data, sampling location data was highly heterogeneous. Samples that could not be confidently assigned toa region with an approx. 100km radius were excluded. For populations with samples from multiple studies, the most accurate source location was used. For locations covered with different accuracy, only the most accurate samples were retained. (For example, we excluded all Spanish individuals from POPRES (only country level data), as human origins provided samples from eleven different regions in Spain)

Merging

All genetic data was merged using plink. We excluded all sites that were not biallelic or where alleles were ambigiously labeled in different source files. This resulted in a file with 1.9M SNP in a total of 8698 individuals, but with only 19.8% average genotype availability, with no SNP genotyped in all individuals. To remove closely related individuals, we first created a LD-pruned set of SNP using the --indep-pairwise 1000 1000 .1 flag in plink. then, we calculated a relationship matrix using the --make-grm-bin flag, and removed individuals with a relationship larger than 0.6, which reduced the number of individuals to 8062 individuals.

Files not present in repo (and how to obtain them)

raw genetic data: (also bim and fam files;, or map for ped files)

these are the files that are required to start the pipeline

  • raw/paschou.zip #downloaded archive
  • raw/MARITIME_ROUTE.bed #extracted
  • raw/POPRES_Genotypes_QC1_v2.bed #popres data from John Novembre
  • raw/reich2011/Australia.bed #stoneking/reich SEAsia data
  • raw/reich2011/Denisova-SEAsia-Oceania.bed #stoneking/reich SEAsia data
  • raw/reich2011/Stoneking.Data.tar #stoneking/reich SEAsia data
  • raw/reich2011/STONEKING.malaysia.ped #stoneking/reich SEAsia data
  • raw/verdu2014/allAutosomes_82-nativeAmericans_illuminaHuman610_unphased_passedQC_SNPs_dbGaP.ped #verdu data from dbgap, not used in paper
  • raw/hugo/Genotypes_All.txt #downloaded HUGO genotypes
  • raw/affy6_344_raw_genotype_xing #downloaded xing et al data
  • raw/xing_sample_pop.txt #individual/pop data for xing et al
  • raw/EuropeAllData/vdata.ind/snp/pop #reich format Lazaridis et al. data
  • raw/Data_for_Ben.bed #estonian biocentre data from mait
  • qatari/NWAfrica_HM3_Qat.bed (African data)
  • qatari/qatari.bed (qatari data)
  • qatari/hg37.bed (lifted african data)
  • tib/HGDP_Tibetan_Merged_160509.bed #obtained from Choongwon Jeong

private meta data

  • sources/POPRES_Phenotypes.txt : obtained from John Novembre through data from 2008 paper

chip info

These files are require to annotate snp correctly, they were obtained from the manufacturer's website and are also required for the automated processing

  • chip/GenomeWideSNP_6.na32.annot.csv
  • chip/Mapping250K_Nsp.na32.annot.csv
  • chip/Mapping250K_Sty.na32.annot.csv

intermediate data files

intermediate Datafiles after basic cleaning, in plink format (also bim and fam files named similarly) they are automatically generated here

  • data/Data_for_Ben.bed #estonian biocentre data from Mait Metspalu
  • data/hugo.bed #hugo data
  • data/MARITIME_ROUTE.bed #Paschou et al data
  • data/POPRES_Genotypes_QC1_v2.bed #popres data
  • data/reich2011.bed
  • data/vdata.bed #Lazaridis full data
  • data/verdu.bed #verdu et al 2014 data (not used in paper)
  • data/xing.bed #xing et al 2010 data

All temporary mergeing files and the merged genotype data

  • merged/*bed
  • merged/*bim
  • merged/*fam

Data file present in repo

liftover for hugo data

  • supplementary/lifted.xbed
  • supplementary/unlifted.xbed list of duplicated labels across studies, used to merge and exclude samples
  • duplicate_dict.txt

location sources:

additional source data (partially processed):

regions/estonian_bibtex.csv
regions/estonian_studies.csv
regions/location2.csv
regions/location_coords2.csv
regions/location_coords.csv
regions/location_full.csv
regions/location_hugo.csv
regions/locations_deduplicated.csv
regions/location_simplified.csv
regions/Stoneking.pops.csv

Tibetan metadata:

tib/tib.plink tib/tibetan.indiv_* #used here tib/tibetan.pop_* #used here tib/tib_tibetan.csv tib/HGDP_Tibetan_Merged_160509_tibetan.indiv* #all tibetan tib/HGDP_Tibetan_Merged_160509_tibetan.pop* #all tibetan (for Jeong et al 2017 tib/HGDP_Tibetan_Merged_160509.indiv* #all data from Jeong et al 2017) tib/HGDP_Tibetan_Merged_160509.pop* #all data from Jeong et al 2017)

Northafrican / Qatari data

qatari/codes.txt qatari/flip.txt qatari/keep_snp.txt

merging update files:

pgs/gvar3.names pgs/update_pos.csv pgs/merge.csv

About


Languages

Language:Python 48.2%Language:HTML 42.8%Language:R 8.8%Language:Shell 0.2%