czakarian / svpop

Variant annotation and merging pipeline

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SV-Pop Pipeline

SV-Pop is an on-demand toolkit for parsing variants into tables (BED files), annotating variants, and merging across tools and/or samples.

Several annotations are built into SV-Pop, such as gene intersect or nearest gene (RefSeq), regulatory element intersects (ENCODE cCRE, histone marks, or DHS), annotated reference regions (chromosome band, simple repeats, segmental duplications, homopolymer intersects, and dinucleotide intersects), and features run on variant sequences (RepeatMasker, TRF, GC content, insertion sequence mapping location). Many of these annotations are automatically pulled from the UCSC browser, generated by SV-Pop itself (e.g. homopolymer sites), or SV-Pop runs an external tool (RepeatMasker, TRF, minimap2).

A single analysis directory can accept variants from many sources (e.g. caller, such as PAV, pbsv, DeepVariant, most VCFs, or any pre-formatted BED table), and each source can have any number of samples. Variants are transformed to a consistent BED+6 format that SV-Pop recognizes (#CHROM, POS, END, ID, SVTYPE, SVLEN, + any number of other fields). Variants can be annotated (e.g. RefSeq intersects), filtered (e.g. confident loci), and merged.

Supported variant types are:

vartype svtype description
snv snv SNPs, SNVs
indel ins, del size < 50 bp
sv ins, del, inv size ≥ 50 bp
dup dup Duplication
sub sub Substitution (ref bases replaced)
rgn rgn Any region

There are several combined svtypes:

  1. insdel: ins and del
  2. insdelinv: ins, del, and inv

There are two types of merges, callerset and sampleset. A callerset merge takes multiple callers for a single sample and merges them into a consensus callset. A sampleset merges a single source (single caller or single callerset) across multiple samples (e.g. nonredundant callset across samples). The merging process is flexible and is capable of using a combination of size, position, matching REF/ALT fields (SNVs), and sequence identity (defaults to 80% identical). Merges of merges are possible, for example, callerset merge followed by a sampleset merge, but are generally not advisable because it severely complicates analysis. SV-Pop also supports intersecting variant sources, which can be used to provide support for a callset without explicitly merging support into the callset itself. Merges are always done on the same variant type (vartype & svtype).

The variant ID for any source (sample + caller, or merge) is unique. IDs are tracks so that any variant can be traced back through to its source. SV-Pop uses this to pull annotations from the original source through a merge. It can also be used by downstream analysis.

Along with the variant BED files, there is also a FASTA file for variant sequences. These sequences are optional except for tasks than explicitly need them (e.g. RepeatMasker on variant sequences or merging by sequence identity).

SV-Pop makes extensive use of wildcards in paths, for example "{sample}" embedded in a path will be replaced with a sample name. This is used to locate input and to request annotations for a specific variant source.

All data files in SV-Pop are gzipped to conserve space. Multiple samples, annotations, and merges can use a surprising amount of disk space.

Install

Clone the repository and submodules:
git clone --recursive https://github.com/EichlerLab/svpop.git

Requires Python 3 with these packages:

  1. BioPython
  2. numpy
  3. pandas
  4. pysam
  5. snakemake
  6. intervaltree
  7. matplotlib (optional)

External tools SV-Pop may call:

  1. samtools: Input variant processing, variant sequence FASTA indexes, and alt-mapping insertions.
  2. tabix: Needed if variants are transformed back to VCF (optional).
  3. bcftools: VCFs are parsed with bcftools into tables for SV-Pop.
  4. minimap2: Used for re-mapping insertion sequence (optional).
  5. bedtools: Mainly used for processing annotations from UCSC.
  6. UCSC toolkit: Generating tracks for UCSC (optional.
  7. RepeatMasker: Annotating SV sequences (optional)
  8. TRF: Annotating SV sequneces (optional).

Configuration

The main configuration file config/config.json is in JSON format. It contains paths to the reference and defines rules for merging. Input sources are contained in config/samples.tsv.

config/config.json

If there are no merges defined, then the configuration file only needs to define a path to the reference assembly and the UCSC reference name for automatically pulling annotations (default "hg38").

For example:

{
  "reference": "/path/to/hg38.no_alt.fa.gz",
  "ucsc_ref_name": "hg38"
}

config/samples.tsv

This is a tab-separated-values (TSV) file defining input sources. TSVs can be edited in a text editor (ensure tabs are the tab character and not spaces) or spreadsheet programs like Excel and LibreOffice. SV-Pop can support any number of input sources, which may be individual callers ()

Field Description
NAME Input source name
SAMPLE Sample name or "DEFAULT"
TYPE Input file type (format or tool)
DATA Path to variant data
VERSION Version of the program that generated the tool
PARAMS Additional parameters for the input parser
COMMENT Free-form comment

NAME: Name can be any identifier, such as "pbsv", "pav-hifi", "ebert2021". This identifier will be used to request variants from a specific source.

SAMPLE: This is typically DEFAULT, which will be used to locate any sample (requires "{sample}" in the DATA path). SV-Pop can also accept a TSV with one line per sample (e.g. NAME = "pbsv" for two lines with a unique sample ID on each line).

TYPE: Determines how the input is parsed.

Supported types are:

  1. bed: DATA paths for pre-formatted BED file (BED 6+, #CHROM, POS, END, ID, SVTYPE, SVLEN, + other optional fields). IDs must be unique.
  2. pavbed: DATA is a path to a bed directory in PAV (results/{sample}/bed in a PAV run directory).
  3. pbsv: DATA is a path to pbsv output VCFs. Requires wildcard "{vartype}" in the path which is filled with "sv" for SVs or "dup" for duplication calls.
  4. sniffles: DATA is a path to a Sniffles VCF.
  5. svim: DATA is a path to a SVIM VCF.
  6. svimasm: DATA is a path to a SVIM-asm VCF.
  7. gatk: DATA is a path to a GATK VCF.
    • Retrieves FORMAT fields GT, GQ, DP, and AD
  8. longshot: DATA is a path to a longshot VCF.
    • Retrieves FORMAT fields GT, GQ, and DP
  9. vcf: DATA is a path to an arbitrary VCF.
    • Use PARAM field keywords "info" and "format" to specify INFO and FORMAT fields to include in the BED file as variant annotations (e.g. "info=AF;format=GT,GQ").

DATA: Path to input. See supported types for a description of what DATA should point to.

VERSION: Can be used by parsers to modify how variants are parsed. Currently only used for documentation purposes.

PARAMS: Additional parameters. Currently used by the generic VCF parser to specify which additional field should be included.

Running SV-Pop

There are two ways to run SV-Pop.

  1. Run scripts: Execute through rundist and runlocal. This requires a little more configuration, but allows SV-Pop to be quickly executed for a number of projects. This is the recommended way to run SV-Pop.
  2. Snakemake direct: Execute by calling snakemake directly.

Both methods are outlined below.

Always run SV-Pop from a clean working directory containing only config, rundist, and runlocal. Do not execute from the install directory.

Run scripts

To execute via run scripts, go to the run directory and link rundist and runlocal from the install directory.

Example:

ln -s /path/to/PAV/1.0.0/rundist ./
ln -s /path/to/PAV/1.0.0/runlocal ./

rundist will be used to distribute over a cluster, and runlocal will be used to run the pipeline in the current session. Configuring rundist will require some knowledge for distributing over a cluster.

Both scripts setup some control variables and pass control over to a user-defined script. rundist will search for config/rundist.sh, and runlocal will search for config/runlocal.sh. It first searches the run directory, then it searches the install directory. Once found, it calls that script to carry out calling Snakemake. Generally, you would add your run scripts to the install config directory so it could be run for any number of projects. Some runs may need custom resources not typical for other projects, and for those, you can override the script in the PAV install directory with one in the run directory.

Examples for what your config/rundist.sh and config/runlocal.sh might look like are in comments at the bottom of rundist and runlocal.

After the symbolic link is created and your config/rundist.sh and/or config/runlocal.sh are in place, PAV is run:

./rundist 20 results/variant/...

./runlocal 20 results/variant/...

Where "results/variant/..." is the path to the desired output file.

The first number (20 in the example) is the number of concurrent jobs. Everything else is passed to Snakemake as a target, and there may be multiple targets.

See TARGETS.md for a list of target output files.

Snakemake direct

Examples in this section will assume shell variable SVPOP is set to the install directory (directory with Snakemake in it).

To run a single sample, request any output file from Snakemake.

For example:

snakemake -s ${SVPOP}/Snakefile results/variant/...

Running targets

There is no uniform endpoint for SV-Pop. Desired output files are requested by running a Snakemake command, and the output is generated. That output may be a set of variant calls, merged variants for multiple samples and/or callers, annotations, or plots.

Example:

./rundist 20 results/variant/caller/pbsv/HG00733/all/all/bed/sv_ins.bed.gz

This file requests variants from caller pbsv in BED format for HG00733.

Wildcards

The above path contains definitions for several wildcards:

results/variant/{sourcetype}/{sourcename}/{sample}/{filter}/{svset}/bed/{vartype}_{svtype}.bed.gz
  1. sourcetype: Type of input. "caller" (single caller/sample), "sampleset" (merge across samples), or "callerset" ( merge across callers).
  2. sourcename: Name of the source.
    • sourcetype caller: Matches NAME in config/samples.tsv
    • sourcetype sampleset or callerset: Configured sampleset or callerset merge (defined in config/config.json)
  3. sample: Name of the sample
    • sampleset: Name of a list of samples defined in config/samples.tsv in section samplelist.
  4. filter: An early filter applied before any annotation or merging. These filters are BED files of regions that should not be included (any variant intersect is dropped). "all" indicates no filtering.
  5. svset: A powerful filter for subsetting variants (e.g. "notr" for no tandem repeats). Can be used to filter by annotations inside the variant BED (e.g. variant length) or outside (e.g. intersection with annotated regions). This is a customizable feature. "all" indicates no filtering.
  6. vartype: Type of variant (snv, sv, indel, dup, rgn, or sub).
  7. svtype: Subtype of variant (snv, ins, del, inv, dup, rgn, sub).

Annotations will have custom parameter wildcards that determine their behavior.

Built-in filters

The "{filter}" wildcard is typically "all" or "lc" (an hg38 filter).

  1. hg38
    1. lc: Drops low-confidence regions determined by Audano 2019 (PMID 30661756) on CLR data. May be outdated for modern technology.
    2. lcy: lc and drops chrY.

Annotations

The pipeline does many annotations including intersecting with UCSC (see below), running TRF and RepeatMasker on inserted or deleted SV sequence, reference mapping location for SV insertions (finding SV donor sites for duplications), and homopolymer run intersections.

  • UCSC tracks: GRC patches, centromeres, gaps, AGP, chromosome band, RefSeq (CDS, ncRNA, intron, upstream/downstream flank), TRF, segmental duplications (SD), RepeatMasker, CpG islands, ENCODE histone marks, and ORegAnno.

Example:

results/variant/caller/pav/HG00733/all/all/anno/refseq/refseq-count_sv_ins.tsv.gz

This file requests RefSeq intersections for PAV HG00733 SV insertions. The table counts the number of bases affected within coding regions, UTRs, introns, and ncRNA exons.

Merging and intersecting variants

The pipeline can merge variants in two ways:

  1. Callerset: Merge variants from different callers for the same sample.
  2. Sampleset: Merge variants from different samples.

Variants can also be intersected without merging, which deterimines which variants from two sources are alike and which are different.

See MERGE.md for information on merging and intersecting.

Cite

Cite the current version in new publications (Ebert 2021):

Ebert P, Audano PA, Zhu Q, Rodriguez-Martin B, Porubsky D, Bonder MJ, Sulovari A, Ebler J, Zhou W, Serra Mari R, Yilmaz F, Zhao X, Hsieh P, Lee J, Kumar S, Lin J, Rausch T, Chen Y, Ren J, Santamarina M, Höps W, Ashraf H, Chuang NT, Yang X, Munson KM, Lewis AP, Fairley S, Tallon LJ, Clarke WE, Basile AO, Byrska-Bishop M, Corvelo A, Evani US, Lu TY, Chaisson MJP, Chen J, Li C, Brand H, Wenger AM, Ghareghani M, Harvey WT, Raeder B, Hasenfeld P, Regier AA, Abel HJ, Hall IM, Flicek P, Stegle O, Gerstein MB, Tubio JMC, Mu Z, Li YI, Shi X, Hastie AR, Ye K, Chong Z, Sanders AD, Zody MC, Talkowski ME, Mills RE, Devine SE, Lee C, Korbel JO, Marschall T, Eichler EE. Haplotype-resolved diverse human genomes and integrated analysis of structural variation. Science. 2021 Apr 2;372(6537):eabf7117. doi: 10.1126/science.abf7117. Epub 2021 Feb 25. PMID: 33632895; PMCID: PMC8026704.

The pipeline was originally published in 2019 (Audano 2019):

Audano PA, Sulovari A, Graves-Lindsay TA, Cantsilieris S, Sorensen M, Welch AE, Dougherty ML, Nelson BJ, Shah A, Dutcher SK, Warren WC, Magrini V, McGrath SD, Li YI, Wilson RK, Eichler EE. Characterizing the Major Structural Variant Alleles of the Human Genome. Cell. 2019 Jan 24;176(3):663-675.e19. doi: 10.1016/j.cell.2018.12.019. Epub 2019 Jan 17. PMID: 30661756; PMCID: PMC6438697.

This is NOT the same as "SV-Pop: population-based structural variant analysis and visualization" (Ravenhall et al. 2019. BMC Bioinformatics). It was named before that paper came out.

About

Variant annotation and merging pipeline


Languages

Language:Python 98.9%Language:Shell 1.1%