Jingwen7 / VAT-Variant-Annotation-Tool

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

VAT: Variant Annotation Tool

VAT: Variant Annotation Tool

VAT is a variant annotation tool that parses VCF files and fetches variant information from Ensembl Variant Effect Predictor (VEP).
Report Bug · Request Feature

Table of Contents

Introduction

VAT is a variant annotation tool that parses VCF files, filters variants that failed filters, annotates the following information and outputs annotation to a tsv file.

  1. Depth of sequence coverage at the site of variation.
  2. Number of reads supporting the variant.
  3. Percentage of reads supporting the variant versus those supporting reference reads.
  4. Gene of the variant, type of variation (substitution, insertion, deleteion, CNV, etc.) and their effect (missense, silent, intergenic, etc.). The API documentation is available here: https://rest.ensembl.org/#VEP
  5. The minor allele and minor allele frequency of the variant if available.
  6. Additional annotations: genotype, rsid, minor_allele_freq, minor_allele, clin_sig, transcript_id, gene_id, impact, gene_symbol, biotype, polyphen_prediction, and sift_prediction. NOTE: there might be multiple transcript_id, gene_id, gene_symbol, polyphen_prediction, sift_prediction for each variant.

Getting Started

To install VAT, python (>= 3.10), vcfpy, requests, json5 and jsonschema are required.

Prerequisites and Installation

  1. Clone the repo
git clone https://github.com/Jingwen7/VAT-Variant-Annotation-Tool.git
cd VAT-Variant-Annotation-Tool/
  1. Create conda env and install the prerequisites
conda create --name vat python=3.10
conda activate vat
pip install -r requirements.txt

Usage

python3 vat.py [-h] -i [INPUT_FILE_PATH] -o [OUTPUT_FILE_PATH]

Example

python3 vat.py -i data/test_vcf_data.vcf -o data/out.tsv

When running vat.py, you shall see the following message:

start fetching VEP API with 15 processors....
finish fetching VEP API....

Input format

A typical input vcf file is provided: https://github.com/Jingwen7/VAT-Variant-Annotation-Tool/blob/main/data/test_vcf_data.vcf

Output format

VAT outputs the annotation of variants to a tsv file. Each row stores the annotation of a variant.

The following shows an example of a tsv file

CHROM	POS	REF	ALT	GT	TYPE	total_read_depth	ref_read_depth	alt_read_depth	Ratio_supporting_reads_alt_vs_ref	rsid	most_severe_consequence	minor_allele_freq	minor_allele	clin_sig	transcript_id	gene_id	gene_symbol	impact	biotype	polyphen_prediction	sift_prediction
1	11090916	C	A	1/1	123	1	122	122.0	COSV68896102	missense_variant	NA	NA	NA	ENST00000400897,ENST00000607145	ENSG00000009724,ENSG00000271895	MASP2,RP4-635E18.8	MODERATE,MODIFIER	protein_coding,antisense	possibly_damaging,NA	deleterious,NA
1	11087524	G	A	1/1	156	2	154	77.0	rs1782455	synonymous_variant	0.3125	G	['benign']	ENST00000240185,ENST00000315091,ENST00000400897,ENST00000439080,ENST00000473869,ENST00000477447,ENST00000480464,ENST00000496840,ENST00000607145	ENSG00000120948,ENSG00000120948,ENSG00000009724,ENSG00000120948,ENSG00000120948,ENSG00000120948,ENSG00000120948,ENSG00000120948,ENSG00000271895	TARDBP,TARDBP,MASP2,TARDBP,TARDBP,TARDBP,TARDBP,TARDBP,RP4-635E18.8	MODIFIER,MODIFIER,LOW,MODIFIER,MODIFIER,MODIFIER,MODIFIER,MODIFIER,MODIFIER	protein_coding,protein_coding,protein_coding,protein_coding,nonsense_mediated_decay,nonsense_mediated_decay,processed_transcript,nonsense_mediated_decay,antisense	NA,NA,NA,NA,NA,NA,NA,NA,NA	NA,NA,NA,NA,NA,NA,NA,NA,NA

The following shows the decription of each column in output tsv file.

Column Name Description
CHROM -
POS -
REF -
ALT -
TYPE Type of the variant (ins, del, snp)
GT Genotype of the variant
total_read_depth Total number of reads at this site
ref_read_depth Total number of reads containing REF allele
alt_read_depth Total number of reads contains ALT alleles
Ratio_supporting_reads_alt_vs_ref Ratio of number of reads supporting ALT versus those supporting REF allele.
rsid Unique label of snp
most_severe_consequence -
minor_allele_freq -
minor_allele -
clin_sig Interpretations of variant's significance to disease (https://www.ncbi.nlm.nih.gov/clinvar/)
transcript_id -
gene_id -
gene_symbol -
impact Simple assessment of the putative impact of the variant based on SnpEff
biotype Gene biotype
polyphen_prediction Potential pathogenicity of a variant
sift_prediction Prediction of whether an amino acid substitution is likely to affect protein function

About


Languages

Language:Python 100.0%