AbExp variant effect prediction pipeline

AbExp is a tool to predict aberrant gene underexpression in 48 human tissue based on DNA sequence variants. It was trained on aberrant gene expression calls from the GTEx dataset.

This repository contains a bioinformatics software pipeline for calculating AbExp variant effect predictions, taking vcf files as input. The preprint to this method can be found on BioRxiv.

Minimum resource requirements

disk space: 1.0-1.4TB for cache files (hg19 + hg38)
- LOFTEE: 25GB
- VEP v108: 113GB
- CADD v1.6: 854GB
- SpliceAI-RocksDB (optional): 349GB
RAM: 64GB
GPU supporting CUDA for SpliceAI annotation

Setup

install conda and mamba on your system

To download the VEP cache (if not existing yet):

VEP_CACHE_PATH="<your cache path here>"
VEP_VERSION=108

mamba env create -f scripts/veff/vep_env.v108.yaml --name vep_v108
conda activate vep_v108

bash misc/install_vep_cache/install_cache_for_version.sh $VEP_VERSION $VEP_CACHE_PATH

conda deactivate

To download the CADD cache (if not existing yet):

CADD_CACHE_PATH="<your cache path here>"

bash misc/download_CADD_v1.6.sh $CADD_CACHE_PATH

To download LOFTEE data and scripts:

LOFTEE_DIR="<your path here>"

bash misc/install_vep_cache/download_loftee.sh $LOFTEE_DIR

configure system_config.yaml:
- specify paths to the VEP cache, CADD cache, LOFTEE data and LOFTEE source code as defined in steps 2-4
- (optional) Disable downloading the SpliceAI-RocksDB cache for pre-computed SpliceAI annotations
- (optional) Change file paths of automatically downloaded annotations to shared location
- (optional) Any options defined in defaults.yaml can be overwritten in this file if necessary
run mamba env create -f envs/abexp-veff-py.yaml
activate the created environment: conda activate abexp-veff-py

Usage

Edit config.yaml to your needs:
- Specify vcf_input_dir. All .vcf|.vcf.gz|.vcf.bgz|.bcf files in this folder will be annotated.
- Specify vcf_is_normalized: True if all variants are left-normalized and biallelic (bcftools norm -cs -m). Otherwise, the pipeline will normalize the variants before annotation.
- Specify the output_dir
- Specify fasta_file and gtf_file corresponding to the human_genome_version. The gtf_file needs to contain Ensembl gene and transcript identifiers. Therefore, it is highly recommended to use the Gencode genome annotations.
An example is pre-configured and can be used to test the pipeline.
Run snakemake --use-conda -c all. All rules are annotated with resource requirements s.t. snakemake can submit jobs to HPC clusters or cloud environments. It is highly recommended to use snakemake with some batch submission system, e.g. SLURM. For further information, please visit the Snakemake documentation.
The resulting variant effect predictions will be stored in <output_dir>/predict/abexp_v1.0/<input_vcf_file>.parquet. It should contain the following columns:
- 'chrom': chromosome of the variant
- 'start': start position of the variant (0-based)
- 'end': end position of the variant (1-based)
- 'ref': reference allele
- 'alt': alternate allele
- 'gene': the gene affected by the variant
- 'tissue': GTEx tissue, e.g. "Artery - Tibial"
- 'tissue_type', GTEx tissue type, e.g. "Blood Vessel"
- 'abexp_v1.0': The predicted AbExp score
- a set of features used to predict the AbExp score

Development setup

Advanced users who want to edit this pipeline can use the following steps to convert the python scripts back to Jupyter notebooks:

Make sure that the jupytext command is available, e.g. via mamba install jupytext
run find scripts/ -iname "*[.py.py|.R.R]" -exec jupytext --sync {} \; to convert all percent scripts to jupyter notebooks Jupyter will then automatically synchronize the percent scripts with the corresponding notebook files.

gagneurlab / AbExp

AbExp variant effect prediction pipeline

Minimum resource requirements

Setup

Usage

Development setup

About

Languages