EchoLee0925 / TIGAR

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TIGAR

TIGAR stands for Transcriptome-Integrated Genetic Association Resource, which is developed using Python and BASH scripts for conducting Transcriptome-Wide Association Studies (TWAS) by training gene expression imputation models by nonparametric Bayesian Dirichlet Process Regression (DPR) and Elastic-Net (PrediXcan) methods with reference transcriptomic panels.

  1. TIGAR can train gene expression imputation models by both nonparametric Bayesian DPR and Elastic-Net (PrediXcan) methods with reference dataset that contain transcriptomic and genetic data of the same samples.
  2. Impute Genetically Regulated gene eXpression (GReX) from Individual-level genetic data.
  3. Conduct TWAS using both Individual-level and Summary-level GWAS data for studying Univariate and Multivariate phenotypes. Both FUSION and S-PrediXcan Z-score test statistics will calculated in the TWAS results.
  4. Conduct VC-TWAS using Individual-level and Summary-level GWAS data using cis-eQTL effect sizes estimated from reference panels for Univariate phenotypes.
  5. Note: please use our most recently updated scripts on Github to conduct TWAS.
  6. Cis-eQTL Effect Sizes (i.e., weights) estimated from GTEx V8 reference data of 48 tissue types by Nonparametric Bayesian DPR method, and Reference LD Files from GTEx V8 EUR samples, are available from Synapse:syn16804296.
  7. Please cite our TIGAR papers:


Software Setup

1. Install BGZIP, TABIX, Python 3.5, and the following Python libraries

Python Environment Setup (using conda or pip)

Using conda
  • Installation:
# create the environment tigarenv
conda create --name tigarenv python=3.5 pandas numpy scipy scikit-learn statsmodels
# deactivate the conda environment
conda deactivate
  • Activate Environment:
# activate the environment
conda activate tigarenv
# set the PYTHONPATH
export PYTHONPATH=${CONDA_PREFIX}/lib/python3.5/site-packages/:$PYTHONPATH

After running TIGAR use conda deactivate to deactivate the conda environment.

Using pip
  • Installation:
# install pip
# install virtualenv
python3 -m pip install --user virtualenv
# cd to preferred install_directory
cd ${install_dir}
# create the virtual environment tigarenv in the current directory
python3 -m virtualenv tigarenv --python=python3.5
# activate the environment
source ${install_dir}/tigarenv/bin/activate
# install the packages
python3 -m pip install numpy==1.15.2 pandas==0.23.4 scikit-learn==0.20.0 scipy==1.1.0 statsmodels==0.9.0
# deactivate the environment
deactivate
  • Activate Environment:
# activate the environment
source ${install_dir}/tigarenv/bin/activate
# set the PYTHONPATH
PYTHONPATH=${install_dir}/tigarenv/lib/python3.5/site-packages/:$PYTHONPATH

After running TIGAR use deactivate to deactivate the environment.

2. Make files executable

  • Make *.sh files in the TIGAR directory executable
TIGAR_dir="/home/jyang/GIT/TIGAR"
chmod 755 ${TIGAR_dir}/*.sh 
  • Make ${TIGAR_dir}/Model_Train_Pred/DPR file executable
chmod 755 ${TIGAR_dir}/Model_Train_Pred/DPR

Input Files

Example input files provided under ./ExampleData/ are generated artificially. All input files are Tab Delimited Text Files.

1. Genotype File

  • --genofile_type vcf
    • --format GT
    • --format DS
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample...
1 100 rs1 C T . PASS . GT:DS 0/0:0.01 ...
  • Sorted by chromosome and base pair position, bgzipped by bgzip, and tabixed by tabix
  • Example tabix command for a VCF file: tabix -f -p vcf *.vcf.gz.
  • First 9 columns are Chromosome number, Base pair position, Variant ID, Reference allele, Alternative allele, Quality score, Filter status, Variant information, Genotype data format
  • Genotype data starts from the 10th column. Each column denotes the genotype data per sample.
  • Example: ./ExampleData/example.vcf.gz

Dosage file

  • --genofile_type dosage
CHROM POS ID REF ALT Sample1 Sample...
1 100 rs** C T 0.01 ...
  • First 5 columns have the same format as the first 5 columns of a VCF file.
  • Dosage genotype data start from the 6th column that are values ranging from 0 to 2, denoting the number of minor alleles. Each column denotes the genotype data per sample.

Usage

--genofile
--genofile_type
--format
  • Train Gene Expression Imputation Models
  • Predict GReX
  • Generate Reference LD Genotype Covariance Files
  • VC-TWAS

2. SampleID File

  • Headerless, single-column file containing sampleIDs to use.
  • Examples:
    • ./ExampleData/sampleID.txt
    • ./ExampleData/test_sampleID.txt

Usage

--train_sampleID

  • Train Gene Expression Imputation Models

--test_sampleID

  • Predict GReX
  • VC-TWAS

--sampleID

  • Generate Reference LD Genotype Covariance Files

3. Gene Expression File

CHROM GeneStart GeneEnd TargetID GeneName Sample1 Sample...
1 100 200 ENSG0000 X 0.2 ...
  • First 5 columns are Chromosome number, Gene start position, Gene end position, Target gene ID, Gene name (optional, could be the same as Target gene ID).
  • Gene expression data start from the 6th column. Each column denotes the corresponding gene expression value per sample.
  • Example: ./ExampleData/gene_exp.txt

Usage

--gene_exp

  • Train Gene Expression Imputation Models
  • TWAS (with individual-level GWAS data)

4. Gene Annotation File

CHROM GeneStart GeneEnd TargetID GeneName
1 100 200 ENSG0000 X
  • Provides a list of genes for TWAS
  • Same format as the first five columns of the Gene Expression File.
  • Example: ./ExampleData/gene_anno.txt

Usage

--gene_anno

  • Predict GReX
  • TWAS (with summary-level GWAS data)
  • VC-TWAS

5. Weight (eQTL effect size) File

CHROM POS REF ALT TargetID ES
1 100 C T ENSG0000 0.2
  • Contain cis-eQTL effect sizes (i.e., SNP weights) estimated from reference data for TWAS. The output *_eQTLweights.txt file by ./TIGAR_Model_Train.sh can be used here as the variant weight file.
  • First 5 columns must be be: Chromosome number, Base pair position, Reference allele, Alternative allele, and Target gene ID.
  • Must contain a column named ES to denote variant weights (estimated eQTL effect sizes from a reference dataset) with for the gene (TargetID).
  • Each row denotes the variant weight (eQTL effect size) for testing the association of the gene (TargetID) with a phenotype
  • Variants will be matched with those from the Zscore file by their unique CHROM:POS:REF:ALT snpID
  • Example: ./ExampleData/eQTLweights.txt

Usage

--weight

  • Predict GReX
  • TWAS (with summary-level GWAS data)
  • VC-TWAS

6. PED

FAM_ID IND_ID FAT_ID MOT_ID SEX PHENO COV1 COV...
11A 11A X X 1 0.2 0.3 ...
  • Phenotype PED File
  • First five columns are Family ID, Individual ID, Paternal ID, Maternal ID, Sex (1=male; 2=female; other=unknown). The combination of family and individual ID should uniquely identify a person.
  • Phenotype is the 6th column. A PED file must have 1 and only 1 phenotype in the 6th column.
  • Other covariates start from the 7th column
  • Example: ./ExampleData/example_PED.ped

Usage

--PED

  • TWAS (with individual-level GWAS data)
  • VC-TWAS

7. PED Info

P PHENO
C COV1
C COV2
C SEX
  • Phenotype and Covariate Information File
  • Used to specify phenotype columns and covariate columns in the PED file to use for the TWAS.
  • Headerless, two-column, tab-delimited file. Specify one column per row. The first column of each row is the type (P for phenotype, C for covariate) and the second column in each row is the name of the column in the PED file.
  • Example: ./ExampleData/PED_Info_*.txt

Usage

--PED_info

  • TWAS (with individual-level GWAS data)
  • VC-TWAS

8. Genome Block Annotation File

CHROM Start End
1 100 20000
  • Genome block annotation file used for generating the reference LD covariance files (the LD file is required for TWAS with summary-level GWAS statistics)
  • A tab-delimited text file with 3 columns CHROM Start End, denoting the chromosome number, block start position, and block ending position
  • Block annotation files can be created from genome segmentation files generated by LDetect
  • Example: ./ExampleData/example_genome_block_CHR1.txt

Usage

--genom_block

  • Generate Reference LD Genotype Covariance Files

9. LD File

  • Contains LD covariance coefficients generated from reference data.
  • Used for TWAS with GWAS summary statistics
  • LD files for TWAS should be made with the same reference data used for model training.

Usage

--LD

  • TWAS (with summary-level GWAS data)

10. Zscore File

CHROM POS REF ALT Zscore
1 100 C T 0.01
  • Contains GWAS summary statistics (Zscore) for TWAS
  • First 4 columns have the same format as the first 4 columns of a VCF file.
  • The name of the column containing the Zscore statistic values to use must be Zscore.
  • The file must be sorted by chromosome and base pair position, bgzipped by bgzip, and tabixed by tabix. Example tabix command: tabix -f -p vcf *_Zscore.txt.gz.
  • Example: ./ExampleData/CHR1_GWAS_Zscore.txt.gz

Usage

--Zscore

  • TWAS (with summary-level GWAS data)

11. GWAS result File

| CHROM | POS | REF | ALT | BETA | SE | |:-----:|:---:|:---:|:---:|:------:| | 1 | 100 | C | T | 0.01 | 0.52 |

  • Contains GWAS summary statistics (BETA) and (SE) for TWAS
  • First 4 columns have the same format as the first 4 columns of a VCF file.
  • The name of the column containing the BETA and SE to use must be BETA and SE.
  • The file must be sorted by chromosome and base pair position, bgzipped by bgzip, and tabixed by tabix. Example tabix command: tabix -f -p vcf *_GWAS.txt.gz.
  • Example: ./ExampleData/sample_GWAS_Result.txt.gz

Usage

--GWAS_result

  • VC-TWAS (with summary-level GWAS data)

Example Usage

1. Train gene expression imputation models per Chromosome

Arguments for both imputation models

  • --model: Gene expression imputation model: elastic_net or DPR
  • --gene_exp: Path to Gene annotation and Expression file
  • --train_sampleID: Path to a file with sampleIDs that will be used for training
  • --chr: Chromosome number need to be specified with respect to the genotype input data
  • --genofile: Path to the training genotype file (bgzipped and tabixed)
  • --genofile_type: Genotype file type: vcf or dosage
  • --format: (Required if genofile_type is vcf) Genotype data format that should be used: GT or DS
    • GT: genotype data
    • DS: dosage data
  • --missing rate: Missing rate threshold. If the rate of missing values for a SNP exceeds this value the SNP will be excluded. Otherwise, missing values for non-excluded SNPs will be imputed as the mean. (default: 0.2)
  • --maf: Minor Allele Frequency threshold (ranges from 0 to 1) to exclude rare variants (default: 0.01
  • --hwe: Hardy Weinberg Equilibrium p-value threshold to exclude variants that violated HWE (default: 0.00001)
  • --window: Window size (in base pairs) around gene region from which to include SNPs (default: 1000000 [+- 1MB region around gene region])
  • --cvR2: Whether to perform 5-fold cross validation by average R2: 0 or 1 (default: 1)
    • 0: Skip 5-fold cross validation evaluation
    • 1: Do 5-fold cross validation and only do final training if the average R2 of the 5-fold validations is at least >= 0.005
  • --thread: Number of simultaneous processes to use for parallel computation (default: 1)
  • --out_dir: Output directory (will be created if it does not exist)
  • --TIGAR_dir: Specify the directory of TIGAR source code

Train nonparametric Bayesian DPR imputation model

Additional arguments
  • --dpr: Bayesian inference algorithm used by DPR: 1 or 2 (default: 1)
    • 1: Variational Bayesian, faster but may be less accurate
    • 2: MCMC, slower but more accurate
  • --ES: Output effect size type: fixed or additive (default: fixed)
    • fixed: use fixed effects only
    • additive: use the sum of fixed and random effects
Example Command
# Setup input file paths
Gene_Exp_train_file="${TIGAR_dir}/ExampleData/gene_exp.txt"
train_sample_ID_file="${TIGAR_dir}/ExampleData/sampleID.txt"
genofile="${TIGAR_dir}/ExampleData/example.vcf.gz"
out_dir="${TIGAR_dir}/ExampleData/output"

# Call TIGAR model training shell script
${TIGAR_dir}/TIGAR_Model_Train.sh \
--model DPR \
--gene_exp ${Gene_Exp_train_file} \
--train_sampleID ${train_sample_ID_file} \
--chr 1 \
--genofile ${genofile} \
--genofile_type vcf \
--format GT \
--maf 0.01 \
--hwe 0.0001 \
--cvR2 1 \
--dpr 1 \
--ES fixed \
--thread 2 \
--out_dir ${out_dir} \
--TIGAR_dir ${TIGAR_dir}

Train Elastic-Net imputation model

Additional arguments
  • --cv: Number of cross validation folds for tuning elastic-net penalty parameter (default: 5)
  • --alpha: Fixed L1 & L2 penalty ratio for elastic-net model (default: 0.5)
    • alpha=0: equivalent to lasso regression
    • alpha=1: equivalent to ridge regression
Example Command
Gene_Exp_train_file="${TIGAR_dir}/ExampleData/gene_exp.txt"
train_sample_ID_file="${TIGAR_dir}/ExampleData/sampleID.txt"
genofile="${TIGAR_dir}/ExampleData/example.vcf.gz"
out_dir="${TIGAR_dir}/ExampleData/output"

${TIGAR_dir}/TIGAR_Model_Train.sh \
--model elastic_net \
--gene_exp ${Gene_Exp_train_file} \
--train_sampleID ${train_sample_ID_file} \
--chr 1 \
--genofile ${genofile} \
--genofile_type vcf \
--format GT \
--maf 0.01 \
--hwe 0.0001 \
--cvR2 1 \
--alpha 0.5 \
--thread 2 \
--out_dir ${out_dir} \
--TIGAR_dir ${TIGAR_dir}

Output

  • CHR${chr}_${model}_eQTLweights.txt is the file storing all eQTL effect sizes (ES) estimated from the gene expression imputation model per gene (TargetID)
  • CHR${chr}_${model}_GeneInfo.txt is the file storing information about the fitted gene expression imputation model per gene (per row), including gene annotation (CHROM GeneStart GeneEnd TargetID GeneName), sample size, number of SNPs used in the model training (n_snp), number of SNPs with non-zero eQTL effect sizes (n_effect_snp), imputation R2 by 5-fold cross validation (CVR2), imputation R2 using all given training samples (TrainR2)
  • CHR${chr}_${model}_train_log.txt is the file storing all log messages for model training.

2. Predict GReX

Predict GReX value with given variant weights (eQTL effect sizes) from trained gene expression imputation models and individual-level genotype data of test samples

Arguments

  • --gene_anno: Gene annotation file to specify the list of genes, which is of the same format as the first five columsn of gene expression file
  • --test_sampleID: Path to a file with sampleIDs that should be contained in the genotype file which should be used for the prediction
  • --chr: Chromosome number need to be specified with respect to the genotype input data
  • --weight: Path to SNP weight (eQTL effect size) file
  • --genofile: Path to the training genotype file (bgzipped and tabixed)
  • --genofile_type: Genotype file type: vcf or dosage
  • --format: (Required if genofile_type is vcf) Genotype data format that should be used: GT or DS
    • GT: genotype data
    • DS: dosage data
  • --window: Window size (in base pairs) around gene region from which to include SNPs (default: 1000000 [+- 1MB region around gene region])
  • --missing rate: Missing rate threshold. If the rate of missing values for a SNP exceeds this value the SNP will be excluded. Otherwise, missing values for non-excluded SNPs will be imputed as the mean. (default: 0.2)
  • --maf_diff: MAF difference threshold. If the difference in MAF between a matching SNP in the eQTL weight file and test genotype file is greater than maf_diff, that SNP will be excluded. (default: 0.2)
  • --thread: Number of simultaneous processes to use for parallel computation (default: 1)
  • --out_dir: Output directory (will be created if it does not exist)
  • --TIGAR_dir: Specify the directory of TIGAR source code

Output

  • CHR${chr}_Pred_GReX.txt: File containing predicted GReX values in the same format as the input gene expression file
  • CHR${chr}_Pred_log.txt: File containing log messages

Example Command

gene_anno_file="${TIGAR_dir}/ExampleData/gene_anno.txt"
test_sample_ID_file="${TIGAR_dir}/ExampleData/test_sampleID.txt"
eQTL_ES_file="${TIGAR_dir}/ExampleData/eQTLweights.txt.gz"

${TIGAR_dir}/TIGAR_GReX_Pred.sh \
--gene_anno ${gene_anno_file} \
--test_sampleID ${test_sample_ID_file} \
--chr 1 \
--weight ${eQTL_ES_file} \
--genofile ${genofile} \
--genofile_type vcf \
--format GT \
--thread 2 \
--out_dir ${out_dir} \
--TIGAR_dir ${TIGAR_dir}

3. TWAS

Arguments for both TWAS types

  • --asso: Specify TWAS type: 1 or 2
    • 1: individual-level TWAS using individual-level phenotype data and predicted GReX (generated by TIGAR_GReX_Pred.sh)
    • 2: summary-level TWAS using GWAS summary Z-score statistics and reference LD (generated by TIGAR_LD.sh)
  • --thread: Number of simultaneous processes to use for parallel computation (default: 1)
  • --out_dir: Output directory (will be created if it does not exist)
  • --TIGAR_dir: Specify the directory of TIGAR source code

TWAS using individual-level GWAS data

Additional arguments
  • --gene_exp: Path to predicted GReX file
  • --PED: Path to PED file that contains phenotype and covariate data
  • --PED_info: A two-column, tab-delimited file specifying phenotype columns and covariate columns in the PED file to use for the TWAS. Specify one column per row. Each row should start with the type of column (P for phenotype, C for covariate) and the column name.
  • --method: Method to use based on type of phenotype(s): OLS or Logit (default: OLS)
    • OLS: quantitative phenotype or multivariate phenotypes
    • Logit: dichotomous univariate phenotype
Output
  • indv_${method}_assoc.txt
  • indv_OLS_TWAS_log.txt: File containing log messages
Example Command
GReX_pred_file="${TIGAR_dir}/ExampleData/CHR1_Pred_GReX.txt"
PED="${TIGAR_dir}/ExampleData/example_PED.ped"
PED_info="${TIGAR_dir}/ExampleData/PED_Info_SinglePheno.txt"

${TIGAR_dir}/TIGAR_TWAS.sh \
--asso 1 \
--gene_exp ${GReX_pred_file} \
--PED ${PED} \
--PED_info ${PED_info} \
--method OLS \
--thread 2 \
--out_dir ${out_dir} \
--TIGAR_dir ${TIGAR_dir}

TWAS using summary-level GWAS data

Additional arguments
  • --gene_anno: Path to gene annotation file to specify a list of gene for TWAS. The first six columns of the gene expression file can be used here.
  • --chr: Chromosome number need to be specified to conduct TWAS per chromosome
  • --weight: Path to SNP weight (eQTL effect size) file. Output file _eQTLweights.txt from model training can be used here.
  • --Zscore: Path to GWAS summary Zscore statistics.
  • --LD: Path to reference LD (SNP genotype covariance matrix) that should be bgzipped and tabixed. Can be generated by using our ${TIGAR_dir}/TWAS/Get_LD.py script with individual-level genotype data from a reference sample.
  • LD files for TWAS should be made with the same reference data used for model training.
  • The tool assumes that the reference/alternate allele orientation of SNPs in the weight file match the LD file. Zscore SNPs are "flipped" to match the weight file.
  • --window: Window size (in base pairs) around gene region from which to include SNPs (default: 1000000 [+- 1MB region around gene region])
  • --weight_threshold: include only SNPs with magnitude of weight greater than this value when conducting TWAS (default: 0)
  • --test_stat: burden Z test statistic to calculate: FUSION, SPrediXcan, or both (default both)
    • FUSION: calculate output Zscore and Pvalue results based on method used by FUSION
    • SPrediXcan: calculate output Zscore and Pvalue results based on method used by S-PrediXcan
    • both: calculate and output Zscore and Pvalue results for both methods
Output
  • CHR${chr}_sumstat_assoc.txt
  • CHR${chr}_TWAS_log.txt
Example Command
gene_anno_file="${TIGAR_dir}/ExampleData/gene_anno.txt"
eQTL_ES_file="${TIGAR_dir}/ExampleData/eQTLweights.txt.gz"
Zscore_file="${TIGAR_dir}/ExampleData/CHR1_GWAS_Zscore.txt.gz"
LD_file="${TIGAR_dir}/ExampleData/CHR1_reference_cov.txt.gz"

${TIGAR_dir}/TIGAR_TWAS.sh \
--asso 2 \
--gene_anno ${gene_anno_file} \
--chr 1 \
--weight ${eQTL_ES_file} \
--Zscore ${Zscore_file} \
--LD ${LD_file} \
--thread 2 \
--out_dir ${out_dir} \
--TIGAR_dir ${TIGAR_dir}

4. Generate reference LD genotype covariance files

Reference LD genotype covaraince file can be generated using reference genotype data, which should be generated per chromosome with corresponding genome segmentation block anntoation file.

LD files for TWAS should be made with the same reference data used for model training.

Arguments

  • --genom_block: Path to genome segmentation block annotation based on LD
  • --sampleID: Path to a file with sampleIDs to be used for generating reference LD files.
  • --chr: Specify chromosome number of the given reference genotype file
  • --genofile: Path to the reference genotype file (bgzipped and tabixed). It is recommended that genotype files with data for multiple chromosomes be split into per-chromosome files.
  • --genofile_type: Genotype file type: vcf or dosage
  • --format: (Required if genofile_type is vcf) Genotype data format that should be used: GT or DS
    • GT: genotype data
    • DS: dosage data
  • --maf: Minor Allele Frequency threshold (ranges from 0 to 1) to exclude rare variants (default: 0.01)
  • --thread: Number of simultaneous processes to use for parallel computation (default: 1)
  • --out_dir: Output directory (will be created if it does not exist)
  • --TIGAR_dir: Specify the directory of TIGAR source code

Output

  • CHR${chr}_reference_cov.txt.gz
  • CHR${chr}_reference_cov.txt.gz.tbi
  • CHR${chr}_LD_log.txt

Example Command

block_annotation="${TIGAR_dir}/ExampleData/example_genome_block_CHR1.txt"
sample_id="${TIGAR_dir}/sampleID.txt"

${TIGAR_dir}/TIGAR_LD.sh \
--genome_block ${block_annotation} \
--sampleID ${sample_id}
--chr 1 \
--genofile ${genofile} \
--genofile_type vcf \
--format GT \
--maf 0.01 \
--thread 2 \
--out_dir ${out_dir} \
--TIGAR_dir ${TIGAR_dir}

5. VC-TWAS

VC-TWAS with cis-eQTL effect sizes estimated from nonparametric Bayesian DPR method for Univariate phenotypes.

Arguments for both VC-TWAS types

  • --gene_anno: Gene annotation file to specify the list of genes, which is of the same format as the first five columns of gene expression file
  • --weight: Path to SNP weight (eQTL effect size) file. The weight file must be generated using the DPR model.
  • --weight_threshold: Threshold for estimated cis-eQTL effect sizes, filter SNPs with absolute cis-eQTL effect sizes smaller than threshold
  • --chr: Chromosome number need to be specified with respect to the genotype input data
  • --window: Window size around gene region from which to include SNPs (default 1000000 for +- 1MB region around gene region)
  • --thread: Number of simultaneous processes to use for parallel computation (default: 1)
  • --out_dir: Output directory (will be created if it does not exist)
  • --TIGAR_dir: Specify the directory of TIGAR source code

VC-TWAS using individual-level GWAS data

Additional arguments
  • --PED: Path to PED file that contains phenotype and covariate data
  • --PED_info: A two-column, tab-delimited file specifying phenotype columns and covariate columns in the PED file to use for the TWAS. Specify one column per row. Each row should start with the type of column (P for phenotype, C for covariate) and the column name.
  • --test_sampleID: Path to a file with sampleIDs that should be contained in the genotype file
  • --genofile: Path to the training genotype file (bgzipped and tabixed)
  • --genofile_type: Genotype file type: vcf or dosage
  • --format: (Required if genofile_type is vcf) Genotype data format that should be used: GT or DS
    • GT: genotype data
    • DS: dosage data
  • --maf: Minor Allele Frequency threshold (ranges from 0 to 1) to exclude rare variants (default: 0.01))
  • --phenotype_type: phenotype type: C or D
    • C: continous phenotype
    • D: binomial (dichotomous) phenotype

Output

  • CHR${chr}_indv_VC_TWAS.txt: File containing VC_TWAS results with TargetID gene information
  • CHR${chr}_VC_TWAS_log.txt: File containing log messages

Example Command

gene_anno_file="${TIGAR_dir}/ExampleData/gene_anno.txt"
PED_info="${TIGAR_dir}/ExampleData/PED_Info_SinglePheno.txt"
test_sample_ID_file="${TIGAR_dir}/ExampleData/test_sampleID.txt"
eQTL_ES_file="${TIGAR_dir}/ExampleData/eQTLweights.txt"
genofile="${TIGAR_dir}/ExampleData/example.vcf.gz"

# continuous phenotype
PED="${TIGAR_dir}/ExampleData/example_PED.ped"
${TIGAR_dir}/VC_TWAS.sh \
--gene_anno ${gene_anno_file} \
--PED ${PED} \
--PED_info ${PED_info} \
--test_sampleID ${test_sample_ID_file} \
--chr 1 \
--weight ${eQTL_ES_file} \
--genofile ${genofile} \
--genofile_type vcf \
--format GT \
--weight_threshold 0.0001 \
--phenotype_type C \
--thread 2 \
--out_dir ${out_dir} \
--TIGAR_dir ${TIGAR_dir}

# dichotomous phenotype
PED="${TIGAR_dir}/ExampleData/example_PED_binary.ped"
${TIGAR_dir}/VC_TWAS.sh \
--gene_anno ${gene_anno_file} \
--PED ${PED} \
--PED_info ${PED_info} \
--test_sampleID ${test_sample_ID_file} \
--chr 1 \
--weight ${eQTL_ES_file} \
--genofile ${genofile} \
--genofile_type vcf \
--format GT \
--weight_threshold 0.0001 \
--phenotype_type D \
--thread 2 \
--out_dir ${out_dir} \
--TIGAR_dir ${TIGAR_dir}

VC-TWAS using summary-level GWAS data

Additional arguments
  • --GWAS_result: Path to GWAS result file
  • --sample_size: Sample size of summary-level GWAS data
  • --LD: Path to Reference LD genotype covariance file

Output

  • CHR${chr}_sum_VC_TWAS.txt: File containing VC_TWAS results with TargetID gene information
  • CHR${chr}_sum_VC_TWAS_log.txt: File containing log messages

Example Command

gene_anno_file="${TIGAR_dir}/ExampleData/gene_anno.txt"
eQTL_ES_file="${TIGAR_dir}/ExampleData/eQTLweights.txt.gz"
ld_file="${TIGAR_dir}/ExampleData/eQTLweights.txt.gz"
gwas_file="${TIGAR_dir}/ExampleData/sample_GWAS_Result.txt.gz"
${TIGAR_dir}/VC_TWAS_summary.sh \
--gene_anno ${gene_anno_file} \
--GWAS_result ${gwas_file} \
--Weight ${eQTL_ES_file} \
--sample_size ${sample_size} \
--weight_threshold 0.0001 \
--LD ${ld_file} \
--chr 1 \
--window 1000000 \
--thread 1 \
--out_dir ${out_dir}

6. BGW-TWAS

Association study using cis- and trans-eQTL effect sizes estimate by BGW-TWAS.

Arguments for both VC-TWAS types

  • --gene_list: Gene list file to specify the names of genes
  • --weight_prefix: Start of path to SNP weight (eQTL effect size) file generated by BGTWAS. This string is the path (including directory) before the gene name.
  • --weight_suffix: End of the path to SNP weight (eQTL effect size) file generated by BGTWAS. This string is the path (including extension) after the gene name.
  • --LD_prefix: Start of path to LD file generated by TIGAR. This string is the path (including directory) before the chromosome number.
  • --LD_suffix: End of the path to LD file generated by TIGAR. This string is the path (including extension) after the chromosome number.
  • --Zscore_prefix: Start of path to Zscore summary statistics file. This string is the path (including directory) before the chromosome number.
  • --Zscore_suffix: End of the path to Zscore summary statistics file. This string is the path (including extension) after the chromosome number.
  • --weight_threshold: Threshold for estimated cis-eQTL effect sizes, filter SNPs with absolute cis-eQTL effect sizes smaller than threshold. (default: 0; Note that BGWTWAS already filters weights.)
  • --thread: Number of simultaneous processes to use for parallel computation (default: 1)
  • --out_dir: Output directory (will be created if it does not exist)
  • --out_prefix: Start of log and output file names. (default: BGWTWAS)
  • --out_twas_file: Output file name. (default: ${out_prefix}_asso.txt, BGWTWAS_asso.txt)
  • --out_log_file: Output log file. (default: ${out_prefix}_log.txt, BGWTWAS_log.txt)
  • --TIGAR_dir: Specify the directory of TIGAR source code

Example Command

out_dir=${TIGAR_dir}/ExampleData/

gene_list=${TIGAR_dir}/ExampleData/gene_list.txt

weight_prefix='${TIGAR_dir}/ExampleData/BGWTWASweights/'
weight_suffix='_BGW_eQTL_weights.txt'

LD_prefix='${TIGAR_dir}/ExampleData/CHR'
LD_suffix='_reference_cov.txt.gz'

Zscore_prefix='${TIGAR_dir}/ExampleData/CHR'
Zscore_suffix='_Zscore.txt.gz'

python ${TIGAR_dir}/TWAS/BGWTWAS_Asso_Study.py \
	--gene_list ${gene_list} \
	--weight_prefix ${weight_prefix} \
	--weight_suffix ${weight_suffix} \
	--Zscore_prefix ${Zscore_prefix} \
	--Zscore_suffix ${Zscore_suffix} \
	--LD_prefix ${LD_prefix} \
	--LD_suffix ${LD_suffix} \
	--thread 1 \
	--TIGAR_dir ${TIGAR_dir} \
	--out_dir ${out_dir}

Updates

  • added --missing_rate option (for excluding SNPs with many missing values) to model training and prediction scripts(default: 0.2)
  • weight files now automatically sorted and bgzipped/tabixed during training step
  • Removed 'dfply' Python package dependency
  • Added VC-TWAS method
  • Added SPrediXcan test statistic calculation to TWAS with summary-level GWAS data (in addition to existing FUSION test statistic); added option (--test_stat) to select which test statistic to use (default: both)
  • Added --weight-threshold option (for excluding SNPs with small effect-sizes) to summary-level TWAS (default: 0)
  • Added --sampleID argument to TIGAR_LD.sh
  • Reduced size of output LD files
  • Improved memory usage of Python scripts (especially for model-training and TWAS with summary-level GWAS data)
  • Improved speed (especially for model-training)
  • Improved error handling for parallelized functions
  • Added/improved log output in all Python scripts
  • Added time elapsed calculation for logging
  • Removed intermediate call_DPR.sh script
  • Added script to do summary-level association test with BGWTWAS weights.

Reference

About


Languages

Language:Python 79.4%Language:Shell 20.6%