The strain-resolved analysis is a widespread demand because the co-existence of strains with distinct functional capacities in the microbial communities indicates unique functional/metabolic capability
- Microbiome association studies: more host phenotypes can be distinguished, which can NOT be achieved at the species level or higher taxonomic ranks
- Strain-specific infection: Help physicians make accurate clinical diagnoses, relevant to bacterial resistance to infection
- Explore the transmission/translocation patterns of strain-specific microorganisms
- The conventional metagenome method requires high sequencing coverage and is thus cost-prohibitive and resource-intensive.
- Low-biomass issues make strain-level microbial identification harder
Strain2bFunc is a streamlined pipeline constructed by C++ to automatically run each of 2bRAD/WMS samples for species-level profiling, strain-level profiling analysis, and downstream statistical analysis. The pipeline contains 5 steps:
-
Step 0 (can be skipped): Species-level profiling using 2bRAD-M
Input data:
Sequence files list, 2bRAD or WGS data
Data format:
The sequence file list contains two columns: the first column represents the sample name, and the second column represents the corresponding sequence file path for each sample. Both relative and absolute file paths are acceptable. The columns are separated by the tab key. For example:
Gut1 example/10_simulated_reduced_metagenomes/fastq_files/OGU_Gut1_0.05M.fastq.gz Gut2 example/10_simulated_reduced_metagenomes/fastq_files/OGU_Gut2_0.05M.fastq.gz ... ... Oral4 example/10_simulated_reduced_metagenomes/fastq_files/OGU_Oral4_0.05M.fastq.gz Oral5 example/10_simulated_reduced_metagenomes/fastq_files/OGU_Oral5_0.05M.fastq.gz
-
Step 1 (can be skipped): Automatically select species
Input data:
- 2bRAD sequence files list;
- Species-level abundance table generated by 2bRAD-M
Data format:
- 2bRAD sequence files list contains two columns: the first column represents the sample name, and the second column represents the corresponding 2bRAD sequence file path for each sample. Both relative and absolute file paths are acceptable. The columns are separated by the tab key. For example:
Gut1 example/10_simulated_reduced_metagenomes_results/Species_results/enzyme_result/Gut1.BcgI.fa.gz Gut2 example/10_simulated_reduced_metagenomes_results/Species_results/enzyme_result/Gut2.BcgI.fa.gz ... ... Oral4 example/10_simulated_reduced_metagenomes_results/Species_results/enzyme_result/Oral4.BcgI.fa.gz Oral5 example/10_simulated_reduced_metagenomes_results/Species_results/enzyme_result/Oral5.BcgI.fa.gz
- Species-level abundance table generated by 2bRAD-M contains N columns (N = (the number of samples) + 7). From left-to-right, are as follows: 1 to 7 - The taxonomic ranks for a microbial taxon identified: 1 - "Kingdom"; 2 - "Phylum"; 3 - "Class"; 4 - "Order"; 5 - "Family"; 6 - "Genus"; 7 - "Species"; 8 to N - The column name indicates a sample ID in this study, which represent the relative abundances of taxa within this sample. For example:
#Kingdom Phylum Class Order Family Genus Species Gut1 Gut2 … Oral4 Oral5 Archaea Methanobacteriota Methanobacteria Methanobacteriales Methanobacteriaceae Methanobrevibacter_A Methanobrevibacter_A_smithii 0.323004011 0 … 0 0 Bacteria Acidobacteriota Blastocatellia Chloracidobacteriales Chloracidobacteriaceae Chloracidobacterium Chloracidobacterium_thermophilum 0 0 … 0.026284875 0.010474906 Bacteria Actinobacteriota Actinomycetia Actinomycetales Actinomycetaceae Pauljensenia Pauljensenia_cardiffensis 0 0 … 0.013618373 0 Bacteria Actinobacteriota Actinomycetia Actinomycetales Bifidobacteriaceae Bifidobacterium Bifidobacterium_adolescentis 0.010172796 0 … 0.008552263 0.005529158 Bacteria Actinobacteriota Actinomycetia Actinomycetales Bifidobacteriaceae Bifidobacterium Bifidobacterium_angulatum 0 0.006732314 … 0 0 Bacteria Actinobacteriota Actinomycetia Actinomycetales Bifidobacteriaceae Bifidobacterium Bifidobacterium_bifidum 0 0.007728297 … 0 0
- 2bRAD sequence files list;
-
Step 2: Strain-level profiling
Input data:
- 2bRAD sequence files list;
- Species list (each row is a species)
Data format:
- 2bRAD sequence files list contains two columns: the first column represents the sample name, and the second column represents the corresponding 2bRAD sequence file path for each sample. Both relative and absolute file paths are acceptable. The columns are separated by the tab key. For example:
Gut1 example/10_simulated_reduced_metagenomes_results/Species_results/enzyme_result/Gut1.BcgI.fa.gz Gut2 example/10_simulated_reduced_metagenomes_results/Species_results/enzyme_result/Gut2.BcgI.fa.gz ... ... Oral4 example/10_simulated_reduced_metagenomes_results/Species_results/enzyme_result/Oral4.BcgI.fa.gz Oral5 example/10_simulated_reduced_metagenomes_results/Species_results/enzyme_result/Oral5.BcgI.fa.gz
- Species list contains only one species, each row is one species. For example:
Isoptericola_variabilis_A Lactiplantibacillus_plantarum
-
Step 3: Function prediction (can be skipped)
Input data:
Integrated into the pipeline, requiring NO extra input files
-
Step 4: Data analysis
Input data:
Metadata file
Data format:
The metadata file includes the information about the sample: when it was collected, where it was collected from, what kind of sample it is, what the properties of the environment or experimental condition from which the sample was taken, and so on. Each row represents a sample, each column represents a feature of the samples. The sample names in the metadata should be consistent with those in the sample list.
All scripts in strain2bfunc are written using Perl, R, C++ and Shell. This program should work properly in the Linux systems (e.g., CentOS, Ubuntu, and Win10 WSL), or MacOS, as all required packages can be appropriately downloaded and installed. OpenMP library is the C/C++ parallel computing library. Most Linux releases have OpenMP already been installed in the system. In MacOS, to install the compiler that supports OpenMP, we recommend using the Homebrew package manager:
brew install gcc
Construction of a strain2bfunc standard database (i.e., 2bGDB) requires approximately 28 GB of disk space.
Running the standard pipeline requires < 30Gb of RAM, which is also compatible with multithreading. For example, the BcgI-derived (default) database size is 9.32 GB, and you will need more than that in RAM if you want to build the default database. In a test early on, the peak memory can reach up to 29 GB.
About 30 minutes are required for loading the 2bGDB. For a typical gut metagenome, ~ 40 minutes are required for strain-level profiling.
Clone the latest version from GitHub:
git clone https://github.com/yfz-96/strain2bfunc/
cd strain2bfunc
Miniconda provides the conda environment and package manager, and is the recommended way to install Strain2bFunc.
After installing Miniconda and opening a new terminal, make sure you’re running the latest version of conda:
conda update conda
Once you have Miniconda installed, create a conda environment with the yml file strain2bfunc.yml.
conda env create -n strain2bfunc --file strain2bfunc.yml
Activate the "strain2bfunc" conda environment by running the following command:
conda activate strain2bfunc
Seamlessly install Strain2bFunc pipeline by simply executing a single command:
source install.sh
Activate the "strain2bfunc" conda environment by running the following command:
conda activate strain2bfunc
Tools can be directly used as Linux/MacOS command line with parameters. To see all available parameters, please run the command with parameter ‘-h’, e.g.
Strain2bFunc-pipeline -h
Then, you can see the detailed usage below.
Welcome to Strain2bFunc Pipeline
Version: 1.0
Usage:
Strain2bFunc-pipeline [Option] Value
Options:
[Composition profiling input and parameters]
Start from step0: Species-level profiling
-i Sequence files list, 2bRAD or WGS data [Conflicts with -l, -T and -L]
-f The acceptable formats of an input sequencing data file. The file path should be also listed in the sample list file [Optional for -i]
[1] generic genome data in a fasta format
[2] shotgun metagenomic data in a fastq format (either SE or PE platform is accepted)
[3] 2bRAD data from a SE sequencing platform in a fastq format
[4] 2bRAD data from a PE sequencing platform in a fastq format
-a the abundance threshold of species for Strain2bFunc analysis, default is 0.01 [Optional for -i and -T]
or
Start from step1: Automatically select species
-l 2bRAD sequence files list [Conflicts with -i]
-T (upper) Input Species-level abundance table generated by 2bRAD-M [Conflicts with -i and -L]
-a the abundance threshold of species for Strain2bFunc analysis, default is 0.01 [Optional for -i and -T]
or
Start from step2: Strain-level profiling
-l 2bRAD sequence files list [Conflicts with -i]
-L (upper) Species list (each row is a species) [Conflicts with -i and -T]
-M (upper) Input the Mode for strain-level analysis, 0 for multiple-species separated analysis, 1 for multiple-species merged analysis, default is 0 [Optional for -L]
[Functional prediction parameter]
-F (upper) Functional analysis, T(rue) or F(alse), default is T
[Statistic input and parameters]
-m Meta data file [Required]
-w Taxonomical distance type, 0: Bray-Curtis, 1: Euclidean, 2: Jaccard, default is 0
-C (upper) Cluster number, default is 2
-G (upper) Network analysis edge threshold, default is 0.5
[Output options]
-o Output path, default is "default_out"
[Other options]
-t Number of threads, default is auto
-h help
- The example dataset including 10 simulated reduced metagenomes (5 from gut and 5 from oral) can be found in the “example” folder. Run the entire pipeline using defaults:
sh example/10_simulated_reduced_metagenomes.sh
or
Strain2bFunc-pipeline -i example/10_simulated_reduced_metagenomes/sample_list.txt -f 2 -a 0.0001 -m example/10_simulated_reduced_metagenomes/meta.txt -o example/10_simulated_reduced_metagenomes_results
- Results
Then the pipeline will automatically generate an output directory named “10_simulated_reduced_metagenomes_results” in the “example” directory. In this directory, there will be five subdirectories and four text files.
Subdirectories
- Species_results: the species-level profiling results using 2bRAD-M
- strain_results: the strain-level profiling results of each species, including N subdirectories (N = the number of species).
- strain_data_analysis_results: the abundance distribution plot, alpha diversity analysis, beta diversity analysis, distance calculation, clustering based on the distance matrix, markers selection based on Random Forests model results.
- ko_results: the predictive ko relative abundance table
- function_data_analysis_results: the abundance distribution plot, alpha diversity analysis, beta diversity analysis, distance calculation, clustering based on the distance matrix, markers selection based on Random Forests model results.
Q1. When you install gcc in MacOS using "brew install gcc", you may meet the warning information:
Warning: gcc 10.2.0_4 is already installed and up-to-date.
To reinstall 10.2.0_4, run:
brew reinstall gcc
A1. Execute this command as indicated in the warning information:
brew reinstall gcc
Q2. When you install gcc in MacOS using "brew install gcc" or "brew reinstall gcc", you may meet the error information:
Error: Cannot install in Homebrew on ARM processor in Intel default prefix (/usr/local)!
Please create a new installation in /opt/homebrew using one of the
"Alternative Installs" from:
https://docs.brew.sh/Installation
You can migrate your previously installed formula list with:
brew bundle dump
A2. You can refer to the link, https://docs.brew.sh/Installation, and execute the following command to (re)install Homebrew.
git clone https://github.com/Homebrew/brew homebrew
eval "$(homebrew/bin/brew shellenv)"
brew update --force --quiet
chmod -R go-w "$(brew --prefix)/share/zsh"
Then install gcc using:
brew install gcc
- Sun, Z., Huang, S., Zhu, P. et al. Species-resolved sequencing of low-biomass or degraded microbiomes using 2bRAD-M. Genome Biol 23, 36 (2022). https://doi.org/10.1186/s13059-021-02576-9
- Huang S, Zhang Y, Liu J, et alIDDF2023-ABS-0267 Strain-resolved taxonomic profiling and functional prediction of human microbiota using Strain2bFuncGut 2023;72:A120-A123.
This work is supported by XXX.