gregmedlock/ssl_ensembles

Project summary

This repository holds all code and data for "Guiding the refinement of biochemical knowledgebases with ensembles of metabolic networks and machine learning", Medlock & Papin, bioRxiv 2018, currently in press at Cell Systems. The manuscript and the accompanying code describe Automated Metabolic Model Ensemble-Driven Elimination of Uncertainty with Statistical learning (AMMEDEUS). Description of all analyses contained in this repository are expanded upon in the manuscript.

Our goal was to develop a system for guiding the curation of genome-scale metabolic network reconstructions (GENREs) by harnessing the variability across ensembles of GENREs. We generated ensembles of GENREs consistent with experimental data, performed gene essentiality simulations with those ensembles, and applied machine learning to identify differences between networks that are associated with differences in simulated gene essentiality. We applied this approach to 29 bacterial species with suitable genomic and phenotypic data available, generating prioritized curation targets for all species. We summarize these findings across all 29 species to identify reactions and subsystems in the biochemical database used in the study that, if curated, would substantially reduce uncertainty in gene essentiality simulations for many organisms.

Repository Description

Folder	File	Description
bin		Analyses
	biolog_processing.ipynb	Reads, thresholds, and saves biolog data from Plata et al.
	curation_targets.ipynb	Calculates curation target metrics and plots Figures 4a-g, Figure 3d, part of Figure 3E, and Supplemental Figure 2a-d
	deep_gapfill.ipynb	Constructs ensembles by gapfilling using algorithm described in Figure 2a-b
	essentiality_dataset_comparison.ipynb	Reads, thresholds, and saves biolog data from Plata et al.
	generate_recons.ipynb	Create and save draft reconstructions from ModelSEED via Mackinac
	iterative_gapfill.ipynb	Previous version of deep_gapfill.ipynb. Not used in manuscript
	plot_knockouts.ipynb	Processes the knockout simulations, performs k-means clustering, performs random forest classification, and plots Figures 3a and 3c
	reaction_and_gene_knockouts.ipynb	Performs gene and reaction knockout simulations for all ensembles
	subsample_ensembles.ipynb	Performs ensemble subsampling and visualization for Figures 2c-d and Supplemental Figure 1a-b.
	test_train_split.ipynb	Verifies that random forest performance on a held out test data set is equivalent to the out-of-bag performance for all species
data		Data sources and lightly-processed versions of the data
	ModelSEED_Subsystems_20190409.tsv	subsystem information for modelseed downloaded from the modelseed biochemistry repository on April 9th, 2019. See data link [1].
	biolog_base_composition.csv	composition for biolog base medium (i.e., media without single C/N source). Determined from biolog plate media compositions used within ModelSEED. See data link [2].
	biolog_names_to_seed.csv	table for converting metabolite names as they appear in Plata et al. to ModelSEED identifiers. Pectin is excluded due to poor representation of polymers in constraint-based metabolic models.
	patric_gapfill_solutions.json	metadata for the gapfill solutions returned by Mackinac when constructing draft reconstructions for this study.
	patric_genomes.csv	PATRIC genome IDs for each species from Plata et al. Note: Plata et al. did not report strains used. Criteria for choosing a representative genome are in the methods text of the manuscript.
	patric_recon_info.json	additional metadata returned from Mackinac during reconstruction generation.
	plata_biolog_raw.csv	Growth data from Plata et al. Data are explained within the manuscript.
	plata_thresholded.csv	thresholded growth data. Species excluded and threshold for binary growth/nogrowth (True/False) are in biolog_processing.ipynb.
	universal_mundy.json	Universal model generated via Mackinac in generate_recons.ipynb.
	reactions_seed_20180809.tsv	summary of ModelSEED reactions used throughout, mainly for the "STATUS" column which indicates whether there are potential mass balance/charge issues in a reaction.
	seed_universal.json	Unused, previous version of the universal model.
	/essentiality/	Experimental gene essentiality data from the Online database of gene essentiality. Also includes PATRIC annotation files for mapping gene IDs from experimental files to PATRIC IDs within the models.
	/modelseed_models/	Output dradft models from generate_recons.ipynb.
	seed_universal.json	Unused, previous version of the universal model.
results		Analysis/simulation output and figures.
		Analysis/simulation output and figures.
	/classification_results/	Cluster assignments and feature imprtances for each species.
	/collapsed_essentiality/	Essentiality simulations, with genes collapsed when another gene had perfectly correlated knockout results.
	/collapsed_features/	Presence/absence of each reaction across each ensemble, with perfectly correlated reactions collapsed into a single feature
	/curation_target_plots/	Curation target plots for each species (e.g. Fig3a)
	/deep_ensembles/	Ensemble with ~1000 models/members for each species. Models are serialized as a pickle object that can be loaded within Python and used in Medusa.
	/ensembles/	Unused, previous version of /deep_ensembles/.
	/gene_knockout_pcoa_plots/	PCoA plots of ensemble gene knockouts, colored by cluster membership.
	/gene_knockout_plots/	Unused, heatmaps of ensemble gene knockouts.
	/gene_knockouts/	Tables with ensemble gene knockout results for each species.
	/precision_recall/	Precision, recall, and ROC plots for 3 species.
	/subsystems/	Plots and tables from subsystem analyses. Includes some unpublished plots.
	ensemble_biomass_subsampling.svg/.png	Supplemental figure 1a-b.
	ensemble_rxn_gene_subsampling.svg/.png	Figure 2c-d.

Data links: [1] https://github.com/ModelSEED/ModelSEEDDatabase/blob/dev/Biochemistry/Pathways/ModelSEED_Subsystems.tsv [2] https://github.com/ModelSEED/ModelSEEDDatabase/blob/dev/Media/media_list_with_meta.txt

Reproducibility Notes

deep_gapfill.ipynb has a loop that iterates over each species to generate an ensemble. For each species, this process took 30-120 minutes in my experience. As a result, notebook failures would often require restarting the process, so the code block checks for existing ensembles for each species. If fast generation of all ensembles is necessary, use of a cluster and reworking the notebook to act on individual species, rather than iterating over all species, is recommended. We opted not to take this approach to simplify the codebase and because I had access to reasonably hefty computing power.
generate_recons.ipynb requires that a user creates a PATRIC account to access the ModelSEED resource via Mackinac. You can create an account on the PATRIC site (https://www.patricbrc.org) but we recommend following whatever the current instructions are for creating a PATRIC account from the ModelSEED website (http://modelseed.org). When executing the notebook, you must replace my account name in the mackinac.get_token() function with your own account name, and then you will be prompted to enter your password within the notebook. This cannot be automated, to my knowledge.
Draft models generated with generate_recons.ipynb contain gapfilled reactions from ModelSEED during their complete media gapfilling step. During this step, flux through biomass is constrained to be positive and exchange reactions for any metabolite for which a transporter is annotated is enabled; reactions required for a feasible solution are added to the network. This process is sensitive to the version of the ModelSEED biochemistry used, thus the solutions returned maay change over time and be substantially different from the models I obtained for this manuscript.
I attempted to generate figures as final as possible in these notebooks; multiple figures were assembled with Adobe Illustrator in the final publication. Chemical structures were drawn in chemdraw.

Dependencies

cobrapy >= 0.13 mackinac (devel branch at the time -- see commit 83eced233406c27ed0d749e6d9f76d41350e3b09 in https://github.com/gregmedlock/mackinac) medusa (devel branch -- see commit e7c7eb88aea0b4dd8d08b996c441ab5f1758cd07 in https://github.com/gregmedlock/Medusa)

gregmedlock / ssl_ensembles

Project summary

Repository Description

Reproducibility Notes

Dependencies

About

Languages