Data Processing

The exact pipeline used is available on github under GITHUB_LINK. As some of the data sets are subjects to some restrictions, they cannot be made available.

Thus, this repo contains all the code used to merge repo, and below (and in the paper) we have info on how data was obtained. We can provide meta-data, however, as this is either part of publications or publicly available

How to run this

obtain all the genetic data files listed below, and copy them in the correct folders (see below)
obtain genotype chip info files
download Snakemake, the workflow was run using version 3.5.2, and is not working on some more recent versions
make sure plink is in the path
install numpy, pandas, scipy for python
install dplyr, yaml, rworldmap, data.table for R
run snakemake pgs/gvar3.indiv_meta

Data Sources

Human Origins Data set (Lazaridis et al. 2014)

The full data set was obtained from David Reich with permission for demographic analyses. Sampling location information was obtained from table S9.4 of Lazaridis et al. 2014. We used the population information in the vdata subset of all ascertainment panels, except for the analysis where we asses ascertainment bias.The utility convert from admixtools (Patterson at al. 2012) was used to convert the data into plink format

Estonian BioCenter data

The data generated by the Estonian Biocenter (REFS) was kindly provided in plink format by Mait Metspalu on 10/30/15, along with location information where it was available. This data set contained 1282568 SNP. Of those, 6770 SNP had non-unique ids and were therefore removed.

HUGO Panasian SNP consortium

The data was downloaded on 6/24/15 from www.biotec.or.th/PASNP. Location-metadata was obtained on the same day from the map on the same website, and individuals were matched to populations using the individual identifiers. All individuals with the same tag were assigned the median of all locations from that tag. The data was first lifted onto hg19 (with 5 out of 54794 SNP being removed), and then reformated into binary plink format.

Xing et al. data

This data set was downloaded on 6/20/15 from http://jorde-lab.genetics.utah.edu/pub/affy6_xing2010/. Sampling locations were kindly provided by Jinchuan Xing. We used version 32 of the annotation file obtained from affymetrix.com to map SNP onto hg19, remove strand-ambiguous SNP and to flip SNP that were on the minus-strand.

POPRES data

POPRES data was obtained under dbGAB accession XXXX to John Novembre, and we used the data as processed in Novembre et al. (2008). We used version 32 of the annotation file obtained from www.affymetrix.com ("Mapping250K_sp.na32.annot.csv" and "Mapping250K_Sty.na32.annot.csv") to filter SNP that did not map onto hg19 and we removed strand-ambiguous AT and GC polymorphisms. Following Novembre et al. we only retained individuals for which all grand-parents were from the same country, and split up the Swiss sample according to language groups.

reich2011 data

The data were obtained on 7/14/15 from Mark Stoneking with permission for demographic analyses. After merging the three different source files, SNP not mapping to hg19 using the annotation file "GenomeWideSNP_6.na32.annot.csv" were removed, as were AT and GC SNPs. Sampling locations were extracted from Figure 1 of Reich et al. (2011)

Paschou et al. (2013)

Data was obtained on 8/13/15 in binary plink format from http://drineas.org/Maritime_Route/RAW_DATA/PLINK_FILES/MARITIME_ROUTE.zip. Sampling location information was obtained from Supplementary Table 3 in Paschou et al. (2013). SNP not mapping to hg19 using the annotation file "GenomeWideSNP_6.na32.annot.csv" were removed, as were AT and GC SNPS.

Jeong et al. (2017), Bigham et al. (2011) and Xu et al. (2012) data

This data was obtained from Choongwon Jeong and Anna Di Rienzo. We used the same filtering as in the Jeong et al. (2017) study, but only added the samples originating from these three studies with permissions from the respective authors.

Combining Meta-information

All Sources with the exception of the Estonian Biocenter data provided (approximate) sampling coordinates. However, the level of accuracy varied between sources, with some providing specific ethnicities, some (such as POPRES) only providing country information and others just providing city- or state-level information. For POPRES-derived data, and most countries, we followed Novembre et al. (2008) and assigned individuals to the countries centerpoint, with the exception of Sweden, Finland, which were assigned their capital.

For the Estonian BioCentre data, sampling location data was highly heterogeneous. Samples that could not be confidently assigned toa region with an approx. 100km radius were excluded. For populations with samples from multiple studies, the most accurate source location was used. For locations covered with different accuracy, only the most accurate samples were retained. (For example, we excluded all Spanish individuals from POPRES (only country level data), as human origins provided samples from eleven different regions in Spain)

Merging

All genetic data was merged using plink. We excluded all sites that were not biallelic or where alleles were ambigiously labeled in different source files. This resulted in a file with 1.9M SNP in a total of 8698 individuals, but with only 19.8% average genotype availability, with no SNP genotyped in all individuals. To remove closely related individuals, we first created a LD-pruned set of SNP using the --indep-pairwise 1000 1000 .1 flag in plink. then, we calculated a relationship matrix using the --make-grm-bin flag, and removed individuals with a relationship larger than 0.6, which reduced the number of individuals to 8062 individuals.

Files not present in repo (and how to obtain them)

raw genetic data: (also bim and fam files;, or map for ped files)

these are the files that are required to start the pipeline

raw/paschou.zip #downloaded archive
raw/MARITIME_ROUTE.bed #extracted
raw/POPRES_Genotypes_QC1_v2.bed #popres data from John Novembre
raw/reich2011/Australia.bed #stoneking/reich SEAsia data
raw/reich2011/Denisova-SEAsia-Oceania.bed #stoneking/reich SEAsia data
raw/reich2011/Stoneking.Data.tar #stoneking/reich SEAsia data
raw/reich2011/STONEKING.malaysia.ped #stoneking/reich SEAsia data
raw/verdu2014/allAutosomes_82-nativeAmericans_illuminaHuman610_unphased_passedQC_SNPs_dbGaP.ped #verdu data from dbgap, not used in paper
raw/hugo/Genotypes_All.txt #downloaded HUGO genotypes
raw/affy6_344_raw_genotype_xing #downloaded xing et al data
raw/xing_sample_pop.txt #individual/pop data for xing et al
raw/EuropeAllData/vdata.ind/snp/pop #reich format Lazaridis et al. data
raw/Data_for_Ben.bed #estonian biocentre data from mait
qatari/NWAfrica_HM3_Qat.bed (African data)
qatari/qatari.bed (qatari data)
qatari/hg37.bed (lifted african data)
tib/HGDP_Tibetan_Merged_160509.bed #obtained from Choongwon Jeong

private meta data

sources/POPRES_Phenotypes.txt : obtained from John Novembre through data from 2008 paper

chip info

These files are require to annotate snp correctly, they were obtained from the manufacturer's website and are also required for the automated processing

chip/GenomeWideSNP_6.na32.annot.csv
chip/Mapping250K_Nsp.na32.annot.csv
chip/Mapping250K_Sty.na32.annot.csv

intermediate data files

intermediate Datafiles after basic cleaning, in plink format (also bim and fam files named similarly) they are automatically generated here

data/Data_for_Ben.bed #estonian biocentre data from Mait Metspalu
data/hugo.bed #hugo data
data/MARITIME_ROUTE.bed #Paschou et al data
data/POPRES_Genotypes_QC1_v2.bed #popres data
data/reich2011.bed
data/vdata.bed #Lazaridis full data
data/verdu.bed #verdu et al 2014 data (not used in paper)
data/xing.bed #xing et al 2010 data

All temporary mergeing files and the merged genotype data

merged/*bed
merged/*bim
merged/*fam

Data file present in repo

liftover for hugo data

supplementary/lifted.xbed
supplementary/unlifted.xbed list of duplicated labels across studies, used to merge and exclude samples
duplicate_dict.txt

location sources:

sources/Data_for_Ben_Meta.xlsx: obtained from Mait Metspalu on November 2015 (email)
sources/Stoneking.pops.txt : From Stoneking.Data.tar, obtained from Mark Stoneking
sources/HGDP_SampleInformation.txt: obtained from wget -O HGDP_SampleInformation.txt http://web.stanford.edu/group/rosenberglab/data/rosenberg2006ahg/SampleInformation.txt
sources/human_origins.csv : Table S9.4, Email from David Reich through John Novembre
sources/POPRES_TS3.csv:table S3 from paper
sources/PASNP_Map.htm : from the website http://www4a.biotec.or.th/PASNP/PASNP_Map
sources/hugo_meta.csv : processed version
sources/Pop_Positions_Xing_2010.csv: from Jichuan Xing by Email
sources/botigue2013.pdf: paper for Botigue2013 data
sources/1000g_loc.csv: from http://www.1000genomes.org/category/frequently-asked-questions/population
sources/journal.pgen.*png: Verduetal paper Table 1 as image
sources/paschou_locations.csv: Table S3 from paper

additional source data (partially processed):

regions/estonian_bibtex.csv
regions/estonian_studies.csv
regions/location2.csv
regions/location_coords2.csv
regions/location_coords.csv
regions/location_full.csv
regions/location_hugo.csv
regions/locations_deduplicated.csv
regions/location_simplified.csv
regions/Stoneking.pops.csv

Tibetan metadata:

tib/tib.plink tib/tibetan.indiv_* #used here tib/tibetan.pop_* #used here tib/tib_tibetan.csv tib/HGDP_Tibetan_Merged_160509_tibetan.indiv* #all tibetan tib/HGDP_Tibetan_Merged_160509_tibetan.pop* #all tibetan (for Jeong et al 2017 tib/HGDP_Tibetan_Merged_160509.indiv* #all data from Jeong et al 2017) tib/HGDP_Tibetan_Merged_160509.pop* #all data from Jeong et al 2017)

Northafrican / Qatari data

qatari/codes.txt qatari/flip.txt qatari/keep_snp.txt

merging update files:

pgs/gvar3.names pgs/update_pos.csv pgs/merge.csv

NovembreLab / eems-merge