strain2bfunc

Motivations

The strain-resolved analysis is a widespread demand because the co-existence of strains with distinct functional capacities in the microbial communities indicates unique functional/metabolic capability

Microbiome association studies: more host phenotypes can be distinguished, which can NOT be achieved at the species level or higher taxonomic ranks
Strain-specific infection: Help physicians make accurate clinical diagnoses, relevant to bacterial resistance to infection
Explore the transmission/translocation patterns of strain-specific microorganisms

Key challenges

The conventional metagenome method requires high sequencing coverage and is thus cost-prohibitive and resource-intensive.
Low-biomass issues make strain-level microbial identification harder

How it works

Strain2bFunc is a streamlined pipeline constructed by C++ to automatically run each of 2bRAD/WMS samples for species-level profiling, strain-level profiling analysis, and downstream statistical analysis. The pipeline contains 5 steps:

Step 0 (can be skipped): Species-level profiling using 2bRAD-M

Input data:

Sequence files list, 2bRAD or WGS data

Data format:

The sequence file list contains two columns: the first column represents the sample name, and the second column represents the corresponding sequence file path for each sample. Both relative and absolute file paths are acceptable. The columns are separated by the tab key. For example:
```
  Gut1	example/10_simulated_reduced_metagenomes/fastq_files/OGU_Gut1_0.05M.fastq.gz
  Gut2	example/10_simulated_reduced_metagenomes/fastq_files/OGU_Gut2_0.05M.fastq.gz
  ...	...
  Oral4	example/10_simulated_reduced_metagenomes/fastq_files/OGU_Oral4_0.05M.fastq.gz
  Oral5	example/10_simulated_reduced_metagenomes/fastq_files/OGU_Oral5_0.05M.fastq.gz
```

Step 1 (can be skipped): Automatically select species

Input data:

2bRAD sequence files list;

Species-level abundance table generated by 2bRAD-M

Data format:

2bRAD sequence files list contains two columns: the first column represents the sample name, and the second column represents the corresponding 2bRAD sequence file path for each sample. Both relative and absolute file paths are acceptable. The columns are separated by the tab key. For example:

  Gut1	example/10_simulated_reduced_metagenomes_results/Species_results/enzyme_result/Gut1.BcgI.fa.gz
  Gut2	example/10_simulated_reduced_metagenomes_results/Species_results/enzyme_result/Gut2.BcgI.fa.gz
  ...	...
  Oral4	example/10_simulated_reduced_metagenomes_results/Species_results/enzyme_result/Oral4.BcgI.fa.gz
  Oral5	example/10_simulated_reduced_metagenomes_results/Species_results/enzyme_result/Oral5.BcgI.fa.gz

Species-level abundance table generated by 2bRAD-M contains N columns (N = (the number of samples) + 7). From left-to-right, are as follows: 1 to 7 - The taxonomic ranks for a microbial taxon identified: 1 - "Kingdom"; 2 - "Phylum"; 3 - "Class"; 4 - "Order"; 5 - "Family"; 6 - "Genus"; 7 - "Species"; 8 to N - The column name indicates a sample ID in this study, which represent the relative abundances of taxa within this sample. For example:

  #Kingdom	Phylum	Class	Order	Family	Genus	Species	Gut1	Gut2	…	Oral4	Oral5
  Archaea	Methanobacteriota	Methanobacteria	Methanobacteriales	Methanobacteriaceae	Methanobrevibacter_A	Methanobrevibacter_A_smithii	0.323004011	0	…	0	0
  Bacteria	Acidobacteriota	Blastocatellia	Chloracidobacteriales	Chloracidobacteriaceae	Chloracidobacterium	Chloracidobacterium_thermophilum	0	0	…	0.026284875	0.010474906
  Bacteria	Actinobacteriota	Actinomycetia	Actinomycetales	Actinomycetaceae	Pauljensenia	Pauljensenia_cardiffensis	0	0	…	0.013618373	0
  Bacteria	Actinobacteriota	Actinomycetia	Actinomycetales	Bifidobacteriaceae	Bifidobacterium	Bifidobacterium_adolescentis	0.010172796	0	…	0.008552263	0.005529158
  Bacteria	Actinobacteriota	Actinomycetia	Actinomycetales	Bifidobacteriaceae	Bifidobacterium	Bifidobacterium_angulatum	0	0.006732314	…	0	0
  Bacteria	Actinobacteriota	Actinomycetia	Actinomycetales	Bifidobacteriaceae	Bifidobacterium	Bifidobacterium_bifidum	0	0.007728297	…	0	0

Step 2: Strain-level profiling

Input data:

2bRAD sequence files list;

Species list (each row is a species)

Data format:

2bRAD sequence files list contains two columns: the first column represents the sample name, and the second column represents the corresponding 2bRAD sequence file path for each sample. Both relative and absolute file paths are acceptable. The columns are separated by the tab key. For example:

  Gut1	example/10_simulated_reduced_metagenomes_results/Species_results/enzyme_result/Gut1.BcgI.fa.gz
  Gut2	example/10_simulated_reduced_metagenomes_results/Species_results/enzyme_result/Gut2.BcgI.fa.gz
  ...	...
  Oral4	example/10_simulated_reduced_metagenomes_results/Species_results/enzyme_result/Oral4.BcgI.fa.gz
  Oral5	example/10_simulated_reduced_metagenomes_results/Species_results/enzyme_result/Oral5.BcgI.fa.gz

Species list contains only one species, each row is one species. For example:

  Isoptericola_variabilis_A
  Lactiplantibacillus_plantarum

Step 3: Function prediction (can be skipped)

Input data:

Integrated into the pipeline, requiring NO extra input files
Step 4: Data analysis

Input data:

Metadata file

Data format:

The metadata file includes the information about the sample: when it was collected, where it was collected from, what kind of sample it is, what the properties of the environment or experimental condition from which the sample was taken, and so on. Each row represents a sample, each column represents a feature of the samples. The sample names in the metadata should be consistent with those in the sample list.

Installation

System requirements

Dependencies

All scripts in strain2bfunc are written using Perl, R, C++ and Shell. This program should work properly in the Linux systems (e.g., CentOS, Ubuntu, and Win10 WSL), or MacOS, as all required packages can be appropriately downloaded and installed. OpenMP library is the C/C++ parallel computing library. Most Linux releases have OpenMP already been installed in the system. In MacOS, to install the compiler that supports OpenMP, we recommend using the Homebrew package manager:

brew install gcc

Disk space

Construction of a strain2bfunc standard database (i.e., 2bGDB) requires approximately 28 GB of disk space.

Memory usage

Running the standard pipeline requires < 30Gb of RAM, which is also compatible with multithreading. For example, the BcgI-derived (default) database size is 9.32 GB, and you will need more than that in RAM if you want to build the default database. In a test early on, the peak memory can reach up to 29 GB.

Speed

About 30 minutes are required for loading the 2bGDB. For a typical gut metagenome, ~ 40 minutes are required for strain-level profiling.

Download the pipeline

Clone the latest version from GitHub:

git clone https://github.com/yfz-96/strain2bfunc/  

cd strain2bfunc

Install Strain2bFunc pipeline in a conda environment

Conda installation

Miniconda provides the conda environment and package manager, and is the recommended way to install Strain2bFunc.

Create a conda environment for Strain2bFunc pipeline:

After installing Miniconda and opening a new terminal, make sure you’re running the latest version of conda:

conda update conda

Once you have Miniconda installed, create a conda environment with the yml file strain2bfunc.yml.

conda env create -n strain2bfunc --file strain2bfunc.yml

Activate the "strain2bfunc" conda environment by running the following command:

conda activate strain2bfunc

Seamlessly install Strain2bFunc pipeline by simply executing a single command:

source install.sh

Strain2bFunc pipeline tutorial

Overview

Usage

The pipeline needs to be executed in the "strain2bfunc" conda environment

Activate the "strain2bfunc" conda environment by running the following command:

conda activate strain2bfunc

Perform the pipeline using one command

Tools can be directly used as Linux/MacOS command line with parameters. To see all available parameters, please run the command with parameter ‘-h’, e.g.

Strain2bFunc-pipeline -h

Then, you can see the detailed usage below.

Welcome to Strain2bFunc Pipeline
Version: 1.0
Usage:
Strain2bFunc-pipeline [Option] Value
Options:

	[Composition profiling input and parameters]
	Start from step0: Species-level profiling
	  -i Sequence files list, 2bRAD or WGS data [Conflicts with -l, -T and -L]
	  -f The acceptable formats of an input sequencing data file. The file path should be also listed in the sample list file [Optional for -i]
	    [1] generic genome data in a fasta format
	    [2] shotgun metagenomic data in a fastq format (either SE or PE platform is accepted)
	    [3] 2bRAD data from a SE sequencing platform in a fastq format
	    [4] 2bRAD data from a PE sequencing platform in a fastq format
	  -a the abundance threshold of species for Strain2bFunc analysis, default is 0.01 [Optional for -i and -T]
	or
	Start from step1: Automatically select species
	  -l 2bRAD sequence files list [Conflicts with -i]
	  -T (upper) Input Species-level abundance table generated by 2bRAD-M [Conflicts with -i and -L]
	  -a the abundance threshold of species for Strain2bFunc analysis, default is 0.01 [Optional for -i and -T]
	or
	Start from step2: Strain-level profiling
	  -l 2bRAD sequence files list [Conflicts with -i]
	  -L (upper) Species list (each row is a species) [Conflicts with -i and -T]
	  -M (upper) Input the Mode for strain-level analysis, 0 for multiple-species separated analysis, 1 for multiple-species merged analysis, default is 0 [Optional for -L]

	[Functional prediction parameter]
	  -F (upper) Functional analysis, T(rue) or F(alse), default is T

	[Statistic input and parameters]
	  -m Meta data file [Required]
	  -w Taxonomical distance type, 0: Bray-Curtis, 1: Euclidean, 2: Jaccard, default is 0
	  -C (upper) Cluster number, default is 2
	  -G (upper) Network analysis edge threshold, default is 0.5

	[Output options]
	  -o Output path, default is "default_out"

	[Other options]
	  -t Number of threads, default is auto
	  -h help

Examples

The example dataset including 10 simulated reduced metagenomes (5 from gut and 5 from oral) can be found in the “example” folder. Run the entire pipeline using defaults:

sh example/10_simulated_reduced_metagenomes.sh

Strain2bFunc-pipeline -i example/10_simulated_reduced_metagenomes/sample_list.txt -f 2 -a 0.0001 -m example/10_simulated_reduced_metagenomes/meta.txt -o example/10_simulated_reduced_metagenomes_results

Results

Then the pipeline will automatically generate an output directory named “10_simulated_reduced_metagenomes_results” in the “example” directory. In this directory, there will be five subdirectories and four text files.

Subdirectories

Species_results: the species-level profiling results using 2bRAD-M

strain_results: the strain-level profiling results of each species, including N subdirectories (N = the number of species).

strain_data_analysis_results: the abundance distribution plot, alpha diversity analysis, beta diversity analysis, distance calculation, clustering based on the distance matrix, markers selection based on Random Forests model results.

ko_results: the predictive ko relative abundance table

function_data_analysis_results: the abundance distribution plot, alpha diversity analysis, beta diversity analysis, distance calculation, clustering based on the distance matrix, markers selection based on Random Forests model results.

FAQ

Q1. When you install gcc in MacOS using "brew install gcc", you may meet the warning information:

Warning: gcc 10.2.0_4 is already installed and up-to-date.
To reinstall 10.2.0_4, run:
  brew reinstall gcc

A1. Execute this command as indicated in the warning information:

brew reinstall gcc

Q2. When you install gcc in MacOS using "brew install gcc" or "brew reinstall gcc", you may meet the error information:

Error: Cannot install in Homebrew on ARM processor in Intel default prefix (/usr/local)!
Please create a new installation in /opt/homebrew using one of the
"Alternative Installs" from:
  https://docs.brew.sh/Installation
You can migrate your previously installed formula list with:
  brew bundle dump

A2. You can refer to the link, https://docs.brew.sh/Installation, and execute the following command to (re)install Homebrew.

git clone https://github.com/Homebrew/brew homebrew
eval "$(homebrew/bin/brew shellenv)"
brew update --force --quiet
chmod -R go-w "$(brew --prefix)/share/zsh"

Then install gcc using:

brew install gcc

Citation

Sun, Z., Huang, S., Zhu, P. et al. Species-resolved sequencing of low-biomass or degraded microbiomes using 2bRAD-M. Genome Biol 23, 36 (2022). https://doi.org/10.1186/s13059-021-02576-9
Huang S, Zhang Y, Liu J, et alIDDF2023-ABS-0267 Strain-resolved taxonomic profiling and functional prediction of human microbiota using Strain2bFuncGut 2023;72:A120-A123.

Acknowledgements

This work is supported by XXX.

HuangShiLab / strain2bfunc