Docker image for VARPP-RuleFit

Overview

This repository contains the necessary scripts to build a docker container for executing VARPP-RuleFit and to predict with the same model on a patient .vcf file.

Install

Docker needs to be installed on the machine you are pulling the image to. To install the image it needs to be pulled from DockerHub in the following manner

docker pull greenlantern/varpp-predict-utils

Run the model

Necessary data

Please note that this image is only one of several components necessary to run the model. The image is enough to create a custom VARPP-Rule model, but requires extra data to predict with it. The docker image is part of a DataLad setup, that requires:

positional data files for genetic variants across all genes available in the GTEx and HCL data sets
In case of prediction with pre-computed models, the VARPP-Rule results files for all HPO terms tested in the associated publication
The container to prepare the patient variant file, in case it is annotated with hg19 instead of hg38 reference genome information.

To get help on how to create a custom VARPP-RuleFit model to be used for prediction of pathogenic variants of a patient, run the following:

Execute the model

docker run greenlantern/varpp-predict-utils varppRule.R --help
Usage: /root/varpp-prediction/varppRule.R [options]

 VARPP-RuleFit.

Options:
	-g CHARACTER, --gene_list=CHARACTER
		A list of potentially pathogenic genes associated with the patient.

	-t CHARACTER, --type=CHARACTER
		Run VARPP-RuleFit with either 'hcl', 'gtex' or 'custom' data

	-p CHARACTER, --user_pathogenic=CHARACTER
		A file with custom pathogenic data in the same format as the HCL or GTEx data.

	-b CHARACTER, --user_benign=CHARACTER
		A file with custom benign data in the same format as the HCL or GTEx data.

	-o CHARACTER, --output_path=CHARACTER
		Specify the path to the results. Defaults to the 'results' in the current directory.

	-c CHARACTER, --cores=CHARACTER
		Maximum number of cores available. This will be distributed across the model

	-n CHARACTER, --ntree=CHARACTER
		Number of trees in the model, defaults to 2000.

	-h, --help
		Show this help message and exit

Example:

  ./varppRule.R -g <gene list> -t <type: GTEx, HCL, custom> -p <User provided pathogenic data> -b <User provided benign data> -o <output/path> -c <Maximum number of cores available> -n <Number of trees, defaults to 2000>

This indicates all the options required for running the program.

Predict with the model

In order to predict with the model we created in the `Run the model` step, use the predict.R function of the container:

docker run greenlantern/varpp-predict-utils predict.R --help
varpp prediction script.

Options:
	-v CHARACTER, --patient_vcf=CHARACTER
		A patient derived .vcf file.

	-d CHARACTER, --data=CHARACTER
		Model data to predict with. Either GTEx or HCL.

	-p CHARACTER, --hpo=CHARACTER
		HPO term ID to predict with. This needs to be in the list of HPO terms that were pre-modelled.

	-c CHARACTER, --custom_model=CHARACTER
		Path to a pre-trained VARPP-RuleFit model in the form of a .rds object If this option is chosen,-p/--hpo option will be ignored.

	-o CHARACTER, --output_path=CHARACTER
		Specify the path to the results. Defaults to the 'results' in the current directory.

	-h, --help
		Show this help message and exit

Example:

  ./predict.R -v <patient.vcf> -d <data: GTEx or HCL> -p <HPO term ID> -o <output/path>

Perform both: Create model and predict

The most complete way of running the VARPP-RuleFit model is to use the varppRuleAndPredict.R function:

docker run greenlantern/varpp-predict-utils varppRuleAndPredict.R --help

Usage: /root/varpp-prediction/varppRuleAndPredict.R [options]

 VARPP-RuleFit.

Options:
	-v CHARACTER, --patient_vcf=CHARACTER
		A patient derived .vcf file.

	-g CHARACTER, --gene_list=CHARACTER
		A list of potentially pathogenic genes associated with the patient.

	-t CHARACTER, --type=CHARACTER
		Run VARPP-RuleFit with either 'hcl', 'gtex' or 'custom' data

	-p CHARACTER, --user_pathogenic=CHARACTER
		A file with custom pathogenic data in the same format as the HCL or GTEx data.

	-b CHARACTER, --user_benign=CHARACTER
		A file with custom benign data in the same format as the HCL or GTEx data.

	-o CHARACTER, --output_path=CHARACTER
		Specify the path to the results. Defaults to the 'results' in the current directory.

	-c CHARACTER, --cores=CHARACTER
		Maximum number of cores available. This will be distributed across the model

	-n CHARACTER, --ntree=CHARACTER
		Number of trees in the model, defaults to 2000.

	-l CHARACTER, --lasso=CHARACTER
		Number of LASSO bootstrap rounds.

	-m CHARACTER, --max_depth=CHARACTER
		Maximum tree depth, defaults to 3.

	-h, --help
		Show this help message and exit

Example:

  ./varppRuleAndPredict.R -v <path to patient vcf file> -g <gene list> -t <type: GTEx, HCL, custom> -p <User provided pathogenic data> -b <User provided benign data> -o <output/path> -c <Maximum number of cores available> -n <Number of trees, defaults to 2000> -l <Number of LASSO bootstrap rounds> -m <Tree depth>

Example

Code: This takes in the gene list for HP:0001948 (Alkalosis)

docker run -v $(pwd):/mnt greenlantern/varpp-predict-utils:latest varppRuleAndPredict.R \
               -v patientVcf/patient.vcf \
               -g ASL,CLCNKB,KCNJ10,SLC12A3,SCNN1B,SCNN1G,HSD11B2,AIP,CLCNKA,BSND,SLC12A1,KCNJ1,SARS2,OTC,NR3C1,CACNA1D,ASS1,CA5A,SLC26A3,SLC25A15,USP8,NARS2,IKZF1,CYP17A1,KCNJ5,HLA-B,SCNN1A,CPS1,NOS1,REN \
               -t gtex \
               -c 64 \
               -l 100 \
               -m 3

Results

The console will show an output similar to the following:

Predictions are:
   Chr   Start     End    Gene CADD_raw_rankscore Prediction
1 chr1  941187  941188  SAMD11             22.200  0.9999974
2 chr1  955212  955213   NOC2L              0.267  0.9999916
3 chr1  968669  968670 PLEKHN1              1.024  0.9999995
4 chr1 1006171 1006172   ISG15              3.296  0.9999857
5 chr1 1023443 1023444    AGRN              1.481  0.9999972
6 chr1 1024497 1024498    AGRN              1.584  0.9999973

and the results are stored in scratch in the current working directory as:

predictions.csv
VARPP-RuleFit_results.rds

sebrauschert / docker-varpp