NCBench continuous small variants benchmarking workflow.

A Snakemake workflow for benchmarking callsets of small genomic variants, using popular benchmark datasets like Genome in a Bottle or CHM-eval. A detailed description of the workflow, also outlining all involved insights and design decisions can be found under https://doi.org/10.12688/f1000research.140344.1.

Contributing callsets

Download raw data:

Germline:

dataset link

NA12878 Agilent (75M and 200M reads):

NA12878 Twist (restricted access but you can ask for it via the zenodo interface):

CHM:
Somatic:

dataset SRA ID tumor fastq link tumor bam SRA ID normal fastq link normal bam

SEQC2 WES SRR7890918 SRR7890919

SEQC2 WGS SRR7890893 SRR7890943

SEQC2 FFPE SRR7890933 SRR7890951

dataset	link
NA12878 Agilent (75M and 200M reads):
NA12878 Twist (restricted access but you can ask for it via the zenodo interface):
CHM:

dataset	SRA ID tumor fastq	SRA ID normal fastq
SEQC2 WES	SRR7890918	SRR7890919
SEQC2 WGS	SRR7890893	SRR7890943
SEQC2 FFPE	SRR7890933	SRR7890951

Run your pipeline on it.
Upload results (VCF or BCF) to zenodo.

Create a pull request that adds your results to the config file, under variant-calls. Thereby, comply to the following structure:

my-callset: # choose a descriptive name for your callset
 labels:
   site: # name of your institute, group, department etc.
   pipeline: # name of the pipeline
   trimming: # tool used to trim reads
   read-mapping: # used read mapper
   base-quality-recalibration: # base recalibration method (remove if unused)
   realignment: # realignment method (remove if unused)
   variant-detection: # variant callers (provide comma-separated list if multiple ones are used)
   genotyping: # genotyper/event-typer used
   url: # URL of used pipeline
   # add any additional relevant attributes (they will appear in the false positive and false negative tables of the online report)
 subcategory: # category of callsets to include this one (see other entries in the config file and align with them if possible)
 zenodo:
   deposition: # zenodo record id (e.g. 7734975)
   filename: # name of vcf/bcf/vcf.gz file in the zenodo record
 benchmark: # benchmark to use (one of giab-NA12878-agilent-200M, giab-NA12878-agilent-75M, giab-NA12878-twist, and more, see https://github.com/snakemake-workflows/dna-seq-benchmark/blob/main/workflow/resources/presets.yaml)
 rename-contigs: resources/rename-contigs/ucsc-to-ensembl.txt # rename contigs from UCSC (prefixed with chr) to Ensembl style (remove if your contigs are already in Ensembl style)

The pull request will be automatically executed with the ncbench workflow and you will be able to download the resulting report with the assessment of your callset as an artifact from the github actions CI interface.
Once the pull request has been reviewed and merged, your results will appear in the online report at https://ncbench.github.io.
If your callset receives an update, update your zenodo record and create a new pull request that updates the zenodo record ID in your config entry.

Checking out results

The latest results for all contributed callsets are shown at https://ncbench.github.io.

Running ncbench locally

For running ncbench locally, the following steps are required:

Mamba and Install snakemake.
Clone this git repository

Adapt the configuration according to your needs (e.g. add your own callset, and maybe remove all the other callsets if you are only interested in your own). Whn adding your own callset, you can either refer to a zenodo repository, but also (which in the local case is probably more useful, refer to a local path. The following is a minimal entry for evaluating a local callset, to be added to the variant-calls section in the file config/config.yaml of your local clone:

my-callset: # choose a descriptive name for your callset
 path: # path to vcf/bcf/vcf.gz file containing your variant calls (both SNVs and indels, sorted by coordinate)
 benchmark: # benchmark to use (one of giab-NA12878-agilent-200M, giab-NA12878-agilent-75M, giab-NA12878-twist, and more, see https://github.com/snakemake-workflows/dna-seq-benchmark/blob/main/workflow/resources/presets.yaml)
 rename-contigs: resources/rename-contigs/ucsc-to-ensembl.txt # rename contigs from UCSC (prefixed with chr) to Ensembl style (remove if your contigs are already in Ensembl style)

Run the workflow, first in dryrun mode with snakemake -n --sdm conda and then in reality with snakemake --sdm conda --cores N with N being your desired number of cores. You can also run it on cluster or cloud middleware. The Snakemake documentation provides all the details.

ncbench / ncbench-workflow

NCBench continuous small variants benchmarking workflow.

Contributing callsets

Checking out results

Running ncbench locally

About

Languages