A network-based approach for isolating the chronic inflammation gene signatures underlying complex diseases towards finding new treatment opportunities

This GitHub repository contains all code used for reproducing results from the manuscript "A network-based approach for isolating the chronic inflammation gene signatures underlying complex diseases towards finding new treatment opportunities", which can be found [here] (Get link)

This markdown documents will provide instructions on how to run the code for a sample disease and how to recreate the results.

General requirements

unix or unix-like OS
Anaconda3 distribution
R version >=4.0.0
Slurm workload manager (To recreate project results)

R libraries needed:

tidyverse 1.3.1
parallel
mccf1 1.1
grid
org.Hs.eg.db 3.12.0
igraph 1.2.9
topGO 2.42.0

Zenodo Download

Required data, as well as copies of pertinent results, are included on our Zenodo record

Note that this record contains ~45 GB.

These can be downloaded with the script get_data.sh. Run this script in the repo so the Zenodo folder appears in this default filepath.

This folder, data_Zenodo, will include:

GenePlexus: A local instance of GenePlexus
GenePlexus_parameter_checks: Output for running GenePlexus with different arguments
GenePlexus_String_Adjacency: Record of output from pipeline and analysis
pascal_out: Pascal output files for UK BioBank traits we used
clinical_trials: Clinical trial data for analysis
biogrid: Biogrid network
ConsensusPathDB: ConsensusPathDB network
string: String network
string-exp: String-exp network
prediction_clusters_same_graph: Compendium of results to create paper figures

data directory

The 'data' directory has some required data. This includes:

disease_gene_files: Seed gene lists, generated in the pipeline,
drugcentral: Data from DrugCentral and scripts used to format it.
dgidb: Data from DGIdb and scripts used to download/format it.

src directory

Scripts used to run the pipeline are located in src.

Contains chronic_inflammation_functions.R, which contains utility functions used by most scripts in the pipeline.

run directory

run contains slurm submission scripts that were used to do our analysis. This readme has instructions for how to run each script without use of slurm for a sample disease

figures directory

figures contains Markdown notebooks used to analyze final results. These results can be found in the Zenodo record

Pipeline instructions

The instructions in this readme will be how to run the pipeline with one sample disease. Running with all traits used in this project requires the use of slurm workload manager and is impractical without it.

Format seed genes

Script:
prep_disease_gene_dfs.R

Purpose: Creates the seed gene files in data/disease_gene_files for specified diseases from Disgenet.

Arguments:

Text file with one column (no header) containing the disease ids of interest from disgenet
Directory where "disease_gene_files" folder will go

Run:

Rscript prep_disease_gene_dfs.R \
 ../data/chronic_inflammation_diseases_non-ovlp_cuid.txt \
 ../data

Getting inflammation genes from human genome

Script:
getInflammationGenesFrom_org.HS.eg.db.R

Purpose: Takes the human genome and gets genes from inflammation related GO terms

Arguments: N/A

Run:

Rscript getInflammationGenesFrom_org.HS.eg.db.R

Creating network edgelists

Script:
prepEdgelist.R

Purpose: Formats and creates an edgelist and Rdata object for a given network

Arguments:

Path to tab delimited edgelist
Path to output dir
Network name
True/False, keep edge weights or not

Run:

Rscript prepEdgelist.R \
 ../data_Zenodo/biogrid/biogrid_entrez_edgelist.txt \
 ../data_Zenodo/biogrid/ \
 bioGRID \
 FALSE

Getting UK Biobank seed genes

Script:
getNegativeControls.R

Purpose: Creates the seed gene files in data/disease_gene_files for the UK BioBank traits used in this project.

Arguments:

File from Zenodo of UK BioBank traits of interest
Location of Pascal output in Zenodo for each trait
location of disease_gene_files, where files will be output

Run:

Rscript getNegativeControls.R \
 ../data_Zenodo/our_ukbb_traits_description.tsv \
 ../data_Zenodo/pascal_out \
 ../data/

Running Geneplexus

Script:
bin/GenePlexus/example_run.py

Purpose: Runs GenePlexus on a trait of interest and output the results. This project used ConsensusPathDB,Adjacency, and DisGeNet for its final results

Arguments:

-i : Disease seed genes
-j : Job name
-n : Network, options are BioGRID, STRING-EXP, STRING, ConsensusPathDB
-f : Features, options are Embedding, Adjacency, Influence
-g : GSC type, options are GO or DisGeNet
-s : Output directory
-fl : Option for how to run, always use local for this project \

Run:

python example_run.py \
 -i ../../data/disease_gene_files/Chronic_Obstructive_Airway_Disease.txt \
 -j Chronic_Obstructive_Airway_Disease--ConsensusPathDB--Adjacency--DisGeNet \
 -n ConsensusPathDB \
 -f Adjacency \
 -g DisGeNet \
 -s ../../results/GenePlexus_output/ \
 -fl local

Summarize Geneplexus output

Script:
summarizeGeneplexusPredictions.R

Purpose: Returns multiple figures showing results for the network combination, along with a summarized Rdata file that has pertinent disease results used in later parts of the pipeline

Arguments:

Path to directory with GenePlexus predictions
Output directory
Average cv threshold

Run:

Rscript summarizeGeneplexusPredictions.R \
  ../results/GenePlexus_output/ \
  ../results/GenePlexus_parameters \
  1.0

Clustering Geneplexus results

Script:
filterAndClusterGeneplexusPredictions.R

Purpose: Takes the GenePlexus predictions and assign genes to clusters for each disease.

Arguments:

GenePlexus prediction path
Prediction threshold, either mccf1 or a number < 1
Path to igraph object containing network for clustering
Leiden algorithm partition type
Resolution parameter
GenePlexus results path
True/False, Is the network weighted?

Run:

Rscript filterAndClusterGeneplexusPredications.R \
 ../results/GenePlexus_output/Chronic_Obstructive_Airway_Disease--ConsensusPathDB--Adjacency--DisGeNet--predictions.tsv \
 0.8 \
 ../data_Zenodo/ConsensusPathDB/ConsensusPathDB_igraph.Rdata \
 ModularityVertexPartition \
 0.1 \
 ../results/prediction_clusters_same_graph \
 TRUE

Clustering inflammation genes

Script:
clusterInflammationGenes.R

Purpose: Clustering the inflammation genes

Arguments:

Path to inflammation genes
Path to igraph object that has network for clustering
Partition type
Resolution parameter
Results path
True/False, Is the network weighted?

Run:

Rscript clusterInflammationGenes.R \
 ../data/disease_gene_files/chronic_inflammatory_response_GO2ALLEGS.txt \
 ../data_Zenodo/ConsensusPathDB/ConsensusPathDB_igraph.Rdata \
 ModularityVertexPartition \
 0.1 \
 ../results/prediction_clusters_same_graph/ \
 TRUE

Clustering random genes

Script:
clusterRandomGenes.R

Purpose: Takes the 5000 fake traits that were randomly generated from a disease and assigns the genes to clusters

Arguments:

Path to data containing all fake traits generated
Disease of interest
Path to igraph object containing network for clustering
Partition type
Resolution parameter
Results path
True/False, Is the network weighted?

Run:

Rscript clusterRandomGenes.R \
 ../data_Zenodo/5000Expandedfaketraits_ConsensusPathDB.tsv \
 Chronic_Obstructive_Airway_Disease \
 ../data_Zenodo/ConsensusPathDB/ConsensusPathDB_igraph.Rdata \
 ModularityVertexPartition \
 0.1 \
 ../results/prediction_clusters_same_graph/clusters/predicted_withConsensusPathDB--clustered_on_ConsensusPathDB \
 TRUE

Get GOBP enriched clusters from GenePlexus resuls

Script:
find_GOBP_enriched_clusters_GenePlexus.R

Purpose: Finds GOBPs that are enriched in each cluster of a disease

Arguments:

Path to cluster file
Background genes from network
Output directory

Run:

Rscript find_GOBP_enriched_clusters_GenePlexus.R \
 ../results/prediction_clusters_same_graph/clusters/predicted_withConsensusPathDB--clustered_on_ConsensusPathDB/Chronic_Obstructive_Airway_Disease--threshold--0.8--PredictionGraph--ConsensusPathDB--ClusterGraph--ConsensusPathDB_clusters.csv \
 ../data_Zenodo/ConsensusPathDB/ConsensusPathDB_genes.csv \
 ../results/prediction_clusters_same_graph/GOBP_enrichment

Get GOBP enriched clusters for inflammation clusters

Script:
find_GOBP_enriched_inflammation_clusters.R

Purpose: Finds GOBPs that are enriched in each cluster of a disease

Arguments:

Path to cluster file
Background genes from network
Output directory

Run:

Rscript find_GOBP_enriched_clusters_GenePlexus.R \
 ../results/prediction_clusters_same_graph/clusters/predicted_withConsensusPathDB--clustered_on_ConsensusPathDB/Chronic_Obstructive_Airway_Disease--threshold--0.8--PredictionGraph--ConsensusPathDB--ClusterGraph--ConsensusPathDB_clusters.csv \
 ../data_Zenodo/ConsensusPathDB/ConsensusPathDB_genes.csv  \
 ../results/prediction_clusters_same_graph/GOBP_enrichment

Cluster overlap scores

Script:
scoreClusterOverlaps_GenePlexus.R

Purpose: For a disease, outputs a file with the overlap score between all real and fake trait clusters that have >=5 genes with chronic inflammation genes.

Also outputs a file with the shared genes between all real and fake trait clusters with chronic inflammation genes

Arguments:

Path to folder containing leiden cluster output files
Path to chronic inflammation prediction file
Path to output directory
Disease of interest
Chronic inflammation prediction threshold
List of genes in network the disease genes were clustered on

Run:

Rscript scoreClusterOverlaps_GenePlexus.R \
 ../results/prediction_clusters_same_graph/clusters/predicted_withConsensusPathDB--clustered_on_ConsensusPathDB \
 ../results/GenePlexus_output/inflammatory_response_GO2EG_expr--ConsensusPathDB--Adjacency--GO--predictions.tsv \
 ../results/prediction_clusters_same_graph \
 Chronic_Obstructive_Airway_Disease \
 0.8 \
 ../data_Zenodo/ConsensusPathDB/ConsensusPathDB_genes.csv

Retrieve significant overlaps

Script:
filterSignificantOverlaps_GenePlexus.R

Purpose: Filters the FDRs for significant values, outputting the significant clusters and real and fake cluster assignments

Arguments:

overlap_results.Rdata location
FDR cutoff
Output directory
Path to gene cluster assignment files

Run:

Rscript filterSignificantOverlaps_GenePlexus.R \
 ../results/prediction_clusters_same_graph/scores/chronic_inflammatory_response_GO2ALLEGS_thresh=0.8_predicted_with_ConsensusPathDB_clusteredOn_ConsensusPathDB_overlap_results.Rdata \
 .05 \
 ../results/prediction_clusters_same_graph/ \
 ../results/prediction_clusters_same_graph/clusters/predicted_withConsensusPathDB--clustered_on_ConsensusPathDB

Running SAveRUNNER to compare clusters with STRING-EXP as the interactome

Script:
prepForClusterSaverunner.R

Purpose: Sets up files for running SAveRUNNER with cluster genes. This instance is stored in data_Zenodo/prediction_clusters_same_graph/SAveRUNNER

Arguments:

Saverunner input directory
path to interactome edgelist
path to "final for alex" file with significant clusters
path to gene cluster assigments

Run:

Rscript prepForClusterSaverunner.R \
 ../data_Zenodo/prediction_clusters_same_graph/SAveRUNNER/code/input_files \
 ../data_Zenodo/prediction_clusters_same_graph/SAveRUNNER/code/input_files/interactome.txt \
 ../data_Zenodo/prediction_clusters_same_graph/chronic_inflammation_gene_shot_pubs_greater10--predictedWith--ConsensusPathDB--clusteredOn--ConsensusPathDB_final_for_alex.csv \
 ../data_Zenodo/prediction_clusters_same_graph/chronic_inflammation_gene_shot_pubs_greater10--predictedWith--ConsensusPathDB--clusteredOn--ConsensusPathDB_relevant_gene_cluster_assigments.csv

Finding drugs

Drugs were obtained using the SAveRUNNER software, located at https://github.com/sportingCode/SAveRUNNER. The instance is stored in data_Zenodo/drugs/SAveRUNNER

Analyses and visualizations

The figures has R markdown scripts that will recreate figures (including supplemental) in the paper

Drug data downloads

The relevant files from DrugCentral are included in data/drugcentral. They came from a local PostgreSQL instance of the DrugCentral database, which can be obtained from DrugCentral.

Relevant files from DGIdb are included in data/dgidb.

DrugCentral Entrez

Script:
data/drugcentral/getDrugCentralEntrez.R

Purpose: This script takes tables from DrugCentral and returns a mapping of Drugs and Entrez targets for humans. This output is already provided.

Run:

Rscript getDrugCentralEntrez.R

krishnanlab / chronic-inflammation

A network-based approach for isolating the chronic inflammation gene signatures underlying complex diseases towards finding new treatment opportunities

General requirements

Zenodo Download

data directory

src directory

run directory

figures directory

Pipeline instructions

Format seed genes

Getting inflammation genes from human genome

Creating network edgelists

Getting UK Biobank seed genes

Running Geneplexus

Summarize Geneplexus output

Clustering Geneplexus results

Clustering inflammation genes

Clustering random genes

Get GOBP enriched clusters from GenePlexus resuls

Get GOBP enriched clusters for inflammation clusters

Cluster overlap scores

Retrieve significant overlaps

Running SAveRUNNER to compare clusters with STRING-EXP as the interactome

Finding drugs

Analyses and visualizations

Drug data downloads

DrugCentral Entrez

About

Languages