hfkroes / variant-prioritization

Pipeline that integrates machine-learning and algorithmic methods to predict potentially splice-altering genetic variants identified in high-throughput sequencing data.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Predicting splice-altering variants



  nextflow run main.nf --i sample.vcf --pcv --gpu --indels spliceai_scores.raw.indel.hg38.vcf.gz --snvs spliceai_scores.raw.snv.hg38.vcf.gz --sdb 2203_hg38 --ref grch38 --fa hg38.fa.gz --t 10 --o results/ -resume

Execution parameters

Parameter Type Description
--i file path/expression Input VCF file path or expression
--pcv standalone parameter Enables usage of precalculated scores
--gpu standalone parameter Enables SpliceAI GPU annotation
--indels file path Precomputed SpliceAI indels scores
--snvs file path Precomputed SpliceAI snvs scores
--sdb folder path SQUIRLS database folder
--fa file path FASTA reference file
--ref string Gene annotation file (either 'grch37', 'grch38' or a custom file)
--t integer CPU threads to use in multithreading-compatible tasks
--o folder path Output folder path

SQUIRLS database folder

SQUIRLS's database can be downloaded from it's documentation page. Make sure to download the adequate files according to the genome build being used.

Illumina's SpliceAI precalculated scores

Illumina made available annotations for all possible substitutions, 1 base insertions, and 1-4 base deletions within genes for download at their online platform. These annotations are free for academic and not-for-profit use; other use requires a commercial license from Illumina, Inc. To use them in our pipeline, you should execute it with the --pcv parameter and use the parameters indels and snvs to indicate the score file paths.

Glob expressions

To input multiple files at once, glob expressions can be utilized as described in the Nextflow documentation. Expressions must be enclosed in quotes while individual file paths do not.

Custom gene annotation files

It's possible to create custom gene annotation files using the files here as a template.

CPU threads

Some tasks in the pipeline allow for the usage of multithreading, and the amount of CPU threads used for those will be determined by the --t parameter


Predictions are annotated on the the VCF file info field with the general format:



SQUIRLS provides a single score that is the maximum predicted splicing pathogenicity score among the ones calculated for each variant. More information can be found at the SQUIRLS documentation


SpliceAI provides a series of scores and information separated by |, in the order presented in the table below. More information can be found at the SpliceAI Github repository

Field Description
ALLELE Alternate allele
SYMBOL Gene symbol
DS_AG Delta score (acceptor gain)
DS_AL Delta score (acceptor loss)
DS_DG Delta score (donor gain)
DS_DL Delta score (donor loss)
DP_AG Delta position (acceptor gain)
DP_AL Delta position (acceptor loss)
DP_DG Delta position (donor gain)
DP_DL Delta position (donor loss)


The SplicePrediction is the final annotation of the pipeline for each variant according to the cutoffs and annotation mode utilized. It indicates P for a pathogenic prediction and N for a non-pathogenic prediction.


Jaganathan K, Kyriazopoulou Panagiotopoulou S, McRae JF, et al. Predicting Splicing from Primary Sequence with Deep Learning. Cell. 2019;176(3):535-548.e24. doi:10.1016/j.cell.2018.12.015

Danis D, Jacobsen JOB, Carmody LC, et al. Interpretable prioritization of splice variants in diagnostic next-generation sequencing [published correction appears in Am J Hum Genet. 2021 Nov 4;108(11):2205]. Am J Hum Genet. 2021;108(9):1564-1577. doi:10.1016/j.ajhg.2021.06.014


Pipeline that integrates machine-learning and algorithmic methods to predict potentially splice-altering genetic variants identified in high-throughput sequencing data.


Language:Nextflow 62.5%Language:Python 30.0%Language:Dockerfile 7.5%