UpalabdhaD / vcfstats

Powerful statistics for VCF files

Home Page:https://pwwang.github.io/vcfstats/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

vcfstats - powerful statistics for VCF files

Pypi Github PythonVers docs github action Codacy Codacy coverage

Documentation | CHANGELOG

Motivation

There are a couple of tools that can plot some statistics of VCF files, including bcftools and jvarkit. However, none of them could:

  1. plot specific metrics
  2. customize the plots
  3. focus on variants with certain filters

R package vcfR can do some of the above. However, it has to load entire VCF into memory, which is not friendly to large VCF files.

Installation

pip install -U vcfstats

Or run with docker or singularity:

docker run --rm justold/vcfstats:latest vcfstats
# or
singularity run docker://justold/vcfstats:latest vcfstats

Gallery

Number of variants on each chromosome

vcfstats --vcf examples/sample.vcf \
	--outdir examples/ \
	--formula 'COUNT(1) ~ CONTIG' \
	--title 'Number of variants on each chromosome' \
	--config examples/config.toml

Number of variants on each chromosome

Changing labels and ticks

vcfstats uses plotnine for plotting, read more about it on how to specify --ggs to modify the plots.

vcfstats --vcf examples/sample.vcf \
	--outdir examples/ \
	--formula 'COUNT(1) ~ CONTIG' \
	--title 'Number of variants on each chromosome (modified)' \
	--config examples/config.toml \
	--ggs 'scale_x_discrete(name ="Chromosome", \
		limits=["1","2","3","4","5","6","7","8","9","10","X"]); \
		ylab("# Variants")'

Number of variants on each chromosome (modified)

Number of variants on first 5 chromosome

vcfstats --vcf examples/sample.vcf \
	--outdir examples/ \
	--formula 'COUNT(1) ~ CONTIG[1,2,3,4,5]' \
	--title 'Number of variants on each chromosome (first 5)' \
	--config examples/config.toml
# or
vcfstats --vcf examples/sample.vcf \
	--outdir examples/ \
	--formula 'COUNT(1) ~ CONTIG[1-5]' \
	--title 'Number of variants on each chromosome (first 5)' \
	--config examples/config.toml
# or
# require vcf file to be tabix-indexed.
vcfstats --vcf examples/sample.vcf \
	--outdir examples/ \
	--formula 'COUNT(1) ~ CONTIG' \
	--title 'Number of variants on each chromosome (first 5)' \
	--config examples/config.toml -r 1 2 3 4 5

Number of variants on each chromosome (first 5)

Number of substitutions of SNPs

vcfstats --vcf examples/sample.vcf \
	--outdir examples/ \
	--formula 'COUNT(1, VARTYPE[snp]) ~ SUBST[A>T,A>G,A>C,T>A,T>G,T>C,G>A,G>T,G>C,C>A,C>T,C>G]' \
	--title 'Number of substitutions of SNPs' \
	--config examples/config.toml

Number of substitutions of SNPs

Only with SNPs PASS all filters

vcfstats --vcf examples/sample.vcf \
	--outdir examples/ \
	--formula 'COUNT(1, VARTYPE[snp]) ~ SUBST[A>T,A>G,A>C,T>A,T>G,T>C,G>A,G>T,G>C,C>A,C>T,C>G]' \
	--title 'Number of substitutions of SNPs (passed)' \
	--config examples/config.toml \
	--passed

Number of substitutions of SNPs (passed)

Alternative allele frequency on each chromosome

# using a dark theme
vcfstats --vcf examples/sample.vcf \
	--outdir examples/ \
	--formula 'AAF ~ CONTIG' \
	--title 'Allele frequency on each chromosome' \
	--config examples/config.toml --ggs 'theme_dark()'

Allele frequency on each chromosome

Using boxplot

vcfstats --vcf examples/sample.vcf \
	--outdir examples/ \
	--formula 'AAF ~ CONTIG' \
	--title 'Allele frequency on each chromosome (boxplot)' \
	--config examples/config.toml \
	--figtype boxplot

Allele frequency on each chromosome

Using density plot/histogram to investigate the distribution:

You can plot the distribution, using density plot or histogram

vcfstats --vcf examples/sample.vcf \
	--outdir examples/ \
	--formula 'AAF ~ CONTIG[1,2]' \
	--title 'Allele frequency on chromosome 1,2' \
	--config examples/config.toml \
	--figtype density

Allele frequency on chromosome 1,2

Overall distribution of allele frequency

vcfstats --vcf examples/sample.vcf \
	--outdir examples/ \
	--formula 'AAF ~ 1' \
	--title 'Overall allele frequency distribution' \
	--config examples/config.toml

Overall allele frequency distribution

Excluding some low/high frequency variants

vcfstats --vcf examples/sample.vcf \
	--outdir examples/ \
	--formula 'AAF[0.05, 0.95] ~ 1' \
	--title 'Overall allele frequency distribution (0.05-0.95)' \
	--config examples/config.toml

Overall allele frequency distribution

Counting types of variants on each chromosome

vcfstats --vcf examples/sample.vcf \
	--outdir examples/ \
	--formula 'COUNT(1, group=VARTYPE) ~ CHROM' \
	# or simply
	# --formula 'VARTYPE ~ CHROM' \
	--title 'Types of variants on each chromosome' \
	--config examples/config.toml

Types of variants on each chromosome

Using bar chart if there is only one chromosome

vcfstats --vcf examples/sample.vcf \
	--outdir examples/ \
	--formula 'COUNT(1, group=VARTYPE) ~ CHROM[1]' \
	# or simply
	# --formula 'VARTYPE ~ CHROM[1]' \
	--title 'Types of variants on chromosome 1' \
	--config examples/config.toml \
	--figtype pie

Types of variants on chromosome 1

Counting variant types on whole genome

vcfstats --vcf examples/sample.vcf \
	--outdir examples/ \
	# or simply
	# --formula 'VARTYPE ~ 1' \
	--formula 'COUNT(1, group=VARTYPE) ~ 1' \
	--title 'Types of variants on whole genome' \
	--config examples/config.toml

Types of variants on whole genome

Counting type of mutant genotypes (HET, HOM_ALT) for sample 1 on each chromosome

vcfstats --vcf examples/sample.vcf \
	--outdir examples/ \
	# or simply
	# --formula 'GTTYPEs[HET,HOM_ALT]{0} ~ CHROM' \
	--formula 'COUNT(1, group=GTTYPEs[HET,HOM_ALT]{0}) ~ CHROM' \
	--title 'Mutant genotypes on each chromosome (sample 1)' \
	--config examples/config.toml

Mutant genotypes on each chromosome

Exploration of mean(genotype quality) and mean(depth) on each chromosome for sample 1

vcfstats --vcf examples/sample.vcf \
	--outdir examples/ \
	--formula 'MEAN(GQs{0}) ~ MEAN(DEPTHs{0}, group=CHROM)' \
	--title 'GQ vs depth (sample 1)' \
	--config examples/config.toml

GQ vs depth (sample 1)

Exploration of depths for sample 1,2

vcfstats --vcf examples/sample.vcf \
	--outdir examples/ \
	--formula 'DEPTHs{0} ~ DEPTHs{1}' \
	--title 'Depths between sample 1 and 2' \
	--config examples/config.toml

Depths between sample 1 and 2

See more examples:

pwwang#15 (comment)

About

Powerful statistics for VCF files

https://pwwang.github.io/vcfstats/


Languages

Language:Python 91.0%Language:Makefile 8.7%Language:Dockerfile 0.4%