Colorectal Cancer Meta-Analysis

Association studies have linked microbiome alterations with many human diseases. However, they have not always reported consistent results, thereby necessitating cross-study comparisons. Here, a meta-analysis of eight geographically and technically diverse fecal shotgun metagenomic studies of colorectal cancer (CRC, n = 768), which was controlled for several confounders, identified a core set of 29 species significantly enriched in CRC metagenomes (false discovery rate (FDR) < 1 × 10−5). CRC signatures derived from single studies maintained their accuracy in other studies. By training on multiple studies, we improved detection accuracy and disease specificity for CRC. Functional analysis of CRC metagenomes revealed enriched protein and mucin catabolism genes and depleted carbohydrate degradation genes. Moreover, we inferred elevated production of secondary bile acids from CRC metagenomes, suggesting a metabolic link between cancer-associated gut microbes and a fat- and meat-rich diet. Through extensive validations, this meta-analysis firmly establishes globally generalizable, predictive taxonomic and functional microbiome CRC signatures as a basis for future diagnostics.

Please note:
The data are at the moment only available from within the EMBL intranet, but will be moved to Zenodo or another repository soon.

Requirements

This project requires R version 3.5.1 -- "Feather Spray"
Please make sure that you have all required packages installed. You can easily check whether that is the case by running
```
Rscript requirements.R
```
and confirm that it does not produce any errors.
This project relies heavily on the R package SIAMCAT.
For the case that you are using an older version of the package, it will undoubtedly throw some errors. Thus, please make sure that you have the version 1.1.0 installed. You can install the correct version of SIAMCAT via the gitlab repository. Download the tar-ball of the package and then install the package by typing in R:
```
install.packages("SIAMCAT-v1.1.0.tar",
    repos = NULL, type="source")
```
- you may have to change the path to the path of the downloaded file
- in Windows, you will need to have the package rtools installed > - i guess... I have not tested this on a Windows system yet
You will need the GMMs tools to calculate the abundance of gut metabolic modules. In this project, I used the version fcc7ac1 of the GMMs tool.

Workflow

To reproduce the main figures of the project, you can follow these instructions.

1. Setup

First, you will have to download the taxonomic and functional profiles and the metadata needed for this project. They are stored on the ftp server of the Zeller team and will be automatically downloaded and cleaned by running this script.

cd ./src
Rscript prepare_data.R

2. Association Testing

In order to run the association testing pipeline, please use the marker_analysis.R script. The script computes the generalised fold change (gFC), area under the Receiver-Operating-Characteristics curve (AUROC), and the significance of single markers using a blocked Wilcoxon test. Additionally, the script will run ANCOM for taxonomic profiles. The marker analysis can be run for different sets of features, e.g. KEGG functional profiles and species-level taxonomic profiles.

Rscript marker_analysis.R KEGG
Rscript marker_analysis.R species

The marker heatmap and the forest plot can be generated by calling this script:

Rscript figure_marker_heatmap.R species

Note; this script may break if you use it for other features than species, e.g. KEGG functional profiles
in this case, you may need to manually tweak the script to get satisfying results

As an additional analysis, we tested how well the meta-analysis associations (seen here as a true set of associations) can be found when assessing only a single study. This is done by calculating precision and recall for the detection of the true set given the single study associations.

Rscript figure_precision_recall.R

Since the concept of the generalized fold change may be new, there is a script that creates a figures which should aid the explanation of the concept:

Rscript figure_fc_explanation.R

3. Confounder Analysis

To reproduce the confounder analysis and the accompanying figure, please type:

Rscript confounder_analysis.R species

The script will compute the variance explained by the disease status (i.e. CRC) and by potential confounding variables (e.g. Age, Sex, BMI, etc.). The results from this analysis motivated the inclusion of Study and Colonoscopy into the differential abundance testing pipeline (by blocking for study and colonoscopy status).
The script will also produce a plot contrasting the variance explained by the potential confounder and by disease status.

If you want to recreate the alpha/beta diversity analysis (which also shows the strong influence of Study as a confounder), you can use this script:

Rscript figure_a_b_diversity.R

We assessed another potential confounder in our data, namely the compositional nature of microbiome data. The method ANCOM accounts for compositionality effects and was applied to the data alongside the blocked Wilcoxon test. The following script reproduces the supplementary figure comparing Wilcoxon and ANCOM:

Rscript figure_ancom.R

4. Clustering

We found that the most strongly associated species can be grouped into four different clusters with preferential enrichment in different patient subgroup. The code to reproduce this analysis can be executed by running:

Rscript cluster_species.R

Note: The ordering of the species in the heat map is not identical to the ordering in the clustered histogram

The script will also produce a set of plots showing the positivity for each species clusters in different patient subgroups.

5. Machine Learning Pipeline

The machine learning pipeline can be run using the train_models.R script. The script will train one LASSO model for each study and one model for each leave-one-study-out setting. The models will be save in the respective ./models/ folder. Please note that this step can take quite some time, depending on the input features. For example, training all models for eggNOG input features took approximately 8 hours on my machine.

After training the models, you can also reproduce the performance figure, using the figure_performance.R script.

Rscript train_models.R species
Rscript figure_performance.R species

Note: training all models for species input features took circa 45 min on my machine.

If you want to run the machine learning pipeline with another machine learning method, you can

either change the respective entry in the parameters.yaml file
or you supply a second command line argument to train_models.R.

For example, if you want to train a Random Forest model and check the performance, you can type:

Rscript train_models.R species randomForest
Rscript figure_performance.R species randomForest

There are a few additional analyses that can be performed:

Bias in the predictions:
To test the predictions for any bias from potential confounding variables, use the script:

Rscript figure_prediction_bias.R species

In this script, you can give again the feature type and additionally the machine learning method as command line arguments.

Classifier weights:
For the LASSO classifier, we can also investigate the feature weights in more detail, using the script:

Rscript figure_weights.R

Combined classifier:
We can also run a classifier on the combination of eggNOG and species features:

Rscript train_models_combination.R

Performance on the gene catalogue:
LASSO models had also been trained on the complete gene catalogue, albeit not with SIAMCAT, since there are too many features. Therefore, we need another script to plot the results:

Rscript figure_performance_genes.R

6. Functional Enrichment

Gut Metabolic Modules
The gut metabolic modules are based on the publication by Vireia-Silva et al. and the accompanying tool GMMs. The GMMs are computed based on KEGG abundances. Please follow the script compute_GMMs.R, in which you will also find instructions to run the GMMs tool.
The main plots for the GMMs can be replicated by calling:

Rscript figure_GMMs.R

Another analysis trying to link the GMMs with species level abundances can be re-run with:

Rscript link_GMM_species.R

Potential CRC-related functions
In order to test several different CRC-related functions of the microbiome, we screened the IGC with several HMMs. The HMMs were created with HMMER and can be found under ./files/genes/hmms/.
To recreate the panels of the main and the extended data figure, you can type:

Rscript figure_gene_abundance.R
Rscript figure_bai_expression.R
Rscript figure_bai_extended.R

7. External Validation

There are two different types of external validations included in this project.

Other CRC studies
To check the associations and the classification accuracy in other CRC studies, you can run these scripts:

Rscript external_validation_associations.R
Rscript external_validation_classification.R species

Here, you can use all the models that have been trained before (i.e. species/KEGG/eggNOG).
Again, you can indicate another machine learning method as second command line argument.
Also, the functional enrichment can be reproduced in the external CRC studies. In order to do so, you can type:

Rscript external_validation_genes.R

Non-CRC studies
We also looked at other studies with shotgun metagenomic analyses that included non-CRC patients to check how much cross-classification we get with other diseases. In order to run this analysis, you can use:

Rscript external_validation_cross_classification.R

Note: since this analysis only works on species-level taxonomic profiles as input features, there is no argument to specify the features. But again, there is an optional argument for the machine learning method to be used (if the method differs from the method specified in the parameters.yaml file)

Contact

If you have questions about the code or the analyses presented in this project, please feel free to contact me, Jakob Wirbel 😃

zellerlab / crc_meta