warthmann / erlich-sample-ID-files

Identify samples from Oxford Nanopore data: prepare files needed by TeamErlich/personal-identification-pipeline

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool


Here we provide scripts to prepare the input files needed by the TeamErlich/personal-identification-pipeline from a multi-sample VCF file.

TeamErlich published a software pipeline to quickly (re-)identify samples from Oxford Nanopore data given a sample database: https://github.com/TeamErlich/personal-identification-pipeline

Sophie Zaaijer, Assaf Gordon, Daniel Speyer, Robert Piccone, Simon Cornelis Groen, Yaniv Erlich (2017) Rapid re-identification of human samples using portable DNA sequencing eLife 6:e27798, https://doi.org/10.7554/eLife.27798

Input files for Arabidopsis

Our use case is Arabidopsis thaliana and 1,135 sequenced accessions from the 1001genomes project.

1001 Genomes Consortium 1,135 Genomes Reveal the Global Pattern of Polymorphism in Arabidopsis thaliana Cell (2016), 166(2) 481-91. https://doi.org/10.1016/j.cell.2016.05.063

The 1001genomes project provides a multi-sample VCF file that contains variants for 1,135 Arabidospis accessions called against the TAIR10 genome.

TAIR10 genome directory at TAIR


file: TAIR10_chr_all.fas.gz

VCF from 1001 genomes project


file: 1001genomes_snp-short-indel_only_ACGTN.vcf.gz

We preprocess the VCF file to split multiallelic variants into multiple lines with

bcftools norm -m - <vcf file>

Parse the resulting file for the genoypes and split into individual .txt files in 23andme format, one for each sample. The tasks is parallelised with gnu parallel. Adjust the number of threads in the script!

bash produce23andmefile.parallel.sh <file with list of sample names> <output directory> <input vcf/bcf file>

Build a list of all variants, assign unique IDs, and augment the 23andme files by adding the ID to each variant. Adjust paths and filenames in the script!

python3 generate-dict-and-SNP-IDs.py 

Fast5 files for testing

Sequencing data from a whole genome ONT (rapid) library of Arabidopsis thaliana on a Flongle

wget https://bss1innov1nafa1poc1.blob.core.windows.net/sample-container/Data-for-github/At-WGS.tar.xz


Identify samples from Oxford Nanopore data: prepare files needed by TeamErlich/personal-identification-pipeline

License:GNU General Public License v3.0


Language:Python 73.2%Language:Shell 26.8%