sephraim / convert_MacArthur_ClinVar_TSV_to_VCF

Download the latest release of the MacArthur Lab's version of ClinVar in TSV format and convert it to a VCF file

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Convert MacArthur ClinVar TSV to VCF

This script will (1.) download the latest release of MacArthur Lab's version of ClinVar in TSV format and (2.) convert it to a VCF file.

Output

The output will be a VCF file (compressed and indexed if possible). The MacArthur Lab splits and left-aligns their TSV file, so in turn, the output VCF will also be split and left-aligned. There is no need for further normalization with BCFtools.

The INFO column will contain the following tags:

  • CLINVAR_CLNALLELE - Tells you which allele, REF or ALT, is the one to which the annotations (e.g. pathogenic assertion) refer
  • CLINVAR_VID - Variation ID; unique identifier for the set of sequence changes that were interpreted; access online at ncbi.nlm.nih.gov/clinvar/variation/{VID}
  • CLINVAR_HGNC - HGNC gene symbol
  • CLINVAR_CLNSIG - Clinical significance (e.g. "Pathogenic", "Likely pathogenic", etc.)
  • CLINVAR_REVSTAT - Clinical review status
  • CLINVAR_HGVS_C - HGVS cDNA name
  • CLINVAR_HGVS_P - HGVS protein name
  • CLINVAR_SUBMITTERS - The names of clinical review submitters
  • CLINVAR_DISEASE - Variant disease name(s)
  • CLINVAR_PMID - Related PubMed IDs
  • CLINVAR_PATHOGENIC - Has this variant been asserted 'Pathogenic' or 'Likely pathogenic' by any submitter for any phenotype? 1 - Yes, 0 - No
  • CLINVAR_CONFLICTED - Has this variant ever been asserted 'Pathogenic' or 'Likely pathogenic' by any submitter for any phenotype and also been asserted 'Benign' or 'Likely benign' by any submitter for any phenotype? 1 - Yes, 0 - No; Note that having one assertion of pathogenic and one of uncertain significance does not count as conflicted for this column

Usage

Simply run:

./get_newest_ClinVar_from_MacArthur.sh

What is info_tag_map.txt?

This is a map file that get_newest_ClinVar_from_MacArthur.sh automatically looks for when converting the original TSV file to a VCF file. It contains 3 tab-separated columns:

  • Column 1: Names of the columns in the original TSV file
  • Column 2: Descriptions to use for the corresponding INFO tag in the output VCF file
  • Column 3: Preferred names of the respective INFO tags in the output VCF file

For example:

 symbol      HGNC gene symbol    HGNC
 hgvs_c      HGVS cDNA name      HGVS_C
 hgvs_p      HGVS protein name   HGVS_P
 all_pmids   PubMed IDs          PMID

All INFO tags will automatically be prefixed with the CLINVAR_ string when the VCF file is written. If a

Handling illegal VCF characters

The VCF format restricts all spaces, semi-colons, and equals-signs from being present in an any INFO field. Additionally, commas are reserved for separating allele-specific values. Therefore, these characters are URL-encoded before they are written to the output VCF.

Illegal VCF character URL-encoding
space %20
, %2C
; %3B
= %3D

Requirements

  • tab2vcf must be in your $PATH (download it here). VCF conversion will be skipped if this is not found.
  • bgzip and tabix must both be in your $PATH (download with the HTSlib package here). If one or both of these are not found, then compression and indexing will be skipped.

Author

Sean Ephraim

About

Download the latest release of the MacArthur Lab's version of ClinVar in TSV format and convert it to a VCF file


Languages

Language:Shell 100.0%