iMAP: Integrated Microbiome Analysis Pipeline

The financial support for developing the iMAP repository ended in October 2018. The maintainer volunteers to be contributing to this repo as a support to the microbiome research community. The primary focus is to make it highly reproducible and more user-friendly. Thank you for your patience.

Version: iMAP v1.0 (Pre-Release)

The iMAP v1.0 is at the preliminary phase. It currently lacks significant aspects of reproducibility compared to the existing modern bioinformatics workflow management systems. Our future plan is to integrate iMAP with a code that defines rules to enable it to be deployed across multiple platforms without any major modifications.

Citation

Teresia M. Buza, Triza Tonui, Francesca Stomeo, Christian Tiambo, Robab Katani, Megan Schilling, Beatus Lyimo, Paul Gwakisa, Isabella M. Cattadori, Joram Buza and Vivek Kapur. iMAP: an integrated bioinformatics and visualization pipeline for microbiome data analysis. BMC Bioinformatics (2019) 20:374. Link.

Supported Analyses

Profiling of sample metadata
Pre-processing and quality control of paired-reads
Sequence processing and classification

mothur (default)
- Phylotype-based method (works for any dataset size).
- OTU-based method (works best for small dataset).
- Phylogeny-based method (works best for small dataset).
QIIME2

Transformation of OTU and taxa results into data structure.
Diversity and statistical analysis, and visualization.
Phylogenetic analysis and interactive tree annotation
Generating web-based progress reports
and more...

Primary iMAP file folders

Requirements

The first step is to gather all the materials needed for implementing the iMAP pipeline as described in Table 1. Most iMAP dependencies are executable and are already placed in the PATH using docker, so users should be able to launch them directly from the command line of the specified container.

Non-Docker Image Users

Read README2.md: README2 guides the implementation of iMAP directly on a specific platform, including Unix-Linux, Mac OS X, and Windows 10. Please note that this is work-in-progress.

Table 1: List of required materials for running iMAP pipeline

Requirement	Description	Location	Remarks
Raw data	Demultiplexed reads in FASTQ format (.gz) with primers and barcodes removed	data/raw	fastq.gz
Sample metadata	File name: samplemetadata.tsv. A tab-separated file linking sample identifiers to the variables	data/metadata	Format: mothur or QIIME2
Mapping files	For linking sample IDs to the data files	data/metadata	Mothur-formatted & QIIME2-formatted

Software (Mostly available via pre-built docker images)

Docker	For creating Docker containers that wrap up iMAP dependencies.	Docker Community Edition (CE)	Link
Seqkit	For inspecting rawdata format and simple statistics.	docker images: readqctools	Link
BBduk.sh via BBMap	For trimming poor quality reads and removing phiX contamination	Auto-loaded at preprocessing step	Link
MultiQC	For summarizing FASTQc output	docker images: readqctools	Link
Mothur	For sequence processing, taxonomy assignment and preliminary analysis	docker images: mothur:v1.41.3	Link
QIIME2	For sequence processing, taxonomy assignment and preliminary analysis	docker images: qiime2core:v2019.1	Link
R	For statistical analysis and visualization	docker image:rpackages:v3.5.2	Link
iTOL	For displaying, annotating and managing phylogenetic trees	Onlline	Link

Reference databases: Any of the following databases can be used.

SILVA NR (mothur)	Mothur-formatted rRNA alignments	data/references	Link
SILVA NR (QIIME2)	QIIME2-formatted classifiers	data/qiime2	Link
SILVA (seed)	Mothur-formatted rRNA alignments	data/references	Link
SILVA(de-gapped)	mothur-formatted classifiers	data/references	Auto-Generated
RDP	Mothur-formatted classifiers	data/references	Link
Greengenes	Mothur-formatted classifiers	data/references	Link
Greengenes	QIIME2-formatted classifiers	data/qiime2	Link
EzBioCloud	Mothur-formatted classifiers	data/references	Link
Custom classifiesr	Any manually built classifiers. Highly recommended when studying a specific group of known microbes.	data/references	Manually-built

Getting Started

Running a shell command as root or system administrator

It is likely that some systems, including Ubuntu, Linux, ... may require users to have administrative right, and in such cases:

Put sudo in front of the command, and enter your password when prompted.
Note that the system is often configured to not ask again for a few minutes allowing you to run several commands in succession.

Download iMAP repository

git clone https://github.com/tmbuza/iMAP.git

# OR

curl -LOk https://github.com/tmbuza/iMAP/archive/master.zip
unzip master.zip
mv iMAP-master iMAP
rm -rf master.zip

# OR

wget --no-check-certificate https://github.com/tmbuza/iMAP/archive/master.zip 
unzip master.zip
mv iMAP-master iMAP
rm -rf master.zip

Add data to designated folders

File formats

Metadata:
- Samplemetadata.tsv
Mapping files:
- Mothur-format: qced.files
- QIIME2-format: manifest.txt
Variable files (Mothur-based preliminary analysis).
- Variable 1: var1.design
- Variable 2: var2.design

Data for optional testing of iMAP

The following command copy the required data files located in the iMAP/resources/ and place them in their respective folders, as shown on Table 1 above.

bash iMAP/code/demo_data.bash

User Options

Users who want to change the default settings may do so using any text editor. The table below shows the location of default parameters that may be altered.

Parameter to change	File Path	Filename	Default
Phred score	iMAP/code/preprocessing	04_get_highscore_reads.bash	trimq=25
Min Contig length	iMAP/code/seqprocessing	01_assemble_paired_reads.batch	minlength=100
Max Contig length	iMAP/code/seqprocessing	01_assemble_paired_reads.batch	maxlength=300
Min alignment length	iMAP/code/seqprocessing	02_align_for_16S_consensus.batch	minlength=100
Max alignment length	iMAP/code/seqprocessing	02_align_for_16S_consensus.batch	maxlength=300
Reference	iMAP/code/seqclassification	01_classify_seqs.batch	silva.seed.ng.fasta
Taxonomy	iMAP/code/seqclassification	01_classify_seqs.batch	silva.seed.tax
Classification cutoff	iMAP/code/seqclassification	01_classify_seqs.batch	cutoff=80
QIIME2 settings	iMAP/code/qiime2	qiime2.bash	DADA2 QC parameters are set at 0

Set up Docker

Link: https://docs.docker.com/install/ Register for a Docker ID. Link: https://docs.docker.com/docker-id/

Download dependencies images

Includes:

rpackages:v3.5.2 for R version 3.5.2 and several packages.
readqctools:v1.0.0 for quality control of the reads.
mothur:v1.41.3 for sequence classification and for generating mothur-based OTU tables.
qiime2core:v2019.1 for sequence classification and for generating qiime2-based OTU table.

Run the following to install the images. Alternatively, to install individual image use docker pull tmbuza/imagename.

# All images at once

bash iMAP/code/dockerImages.sh

# Individual image

docker pull tmbuza/imagename

Confirm the installation

docker images

Start the analysis

Metadata profiling

containerName=report1
docker run --rm --name=$containerName -it -v $(pwd)/iMAP:/imap --workdir=/imap  tmbuza/rpackages:v3.5.2 /bin/bash

bash code/01_metadataProfiling_driver.bash
exit

Read Preprocessing

containerName=readpreprocess
docker run --rm --name=$containerName -it -v $(pwd)/iMAP:/imap tmbuza/readqctools:v1.0.0 /bin/bash

bash code/02_readPreprocess_driver.bash

exit

The HTML files summarizing the Read FastQC reports are stored in the results/multiqc/ folder. Open the files in your favorite browser or try to open it using CLI like:

open results/multiqc/qced/R1/multiqc_report.html

Preprocessing progress report

containerName=report2
docker run --rm --name=$containerName -it -v $(pwd)/iMAP:/imap --workdir=/imap  tmbuza/rpackages:v3.5.2 /bin/bash

bash code/progressreport2.bash
exit

MOTHUR: Sequence Processing and classification

Create a mothur container for sequence processing and classification.

containerName=mothurseqprocessing
docker run --rm --name=$containerName -it -v $(pwd)/iMAP:/imap --workdir=/imap tmbuza/mothur:v1.41.3 /bin/bash

Run the sequence processing and classification command which implements the folllowing:
- Download reference alignments
  - Default: SILVA seed
- Assemble the forward and reverse reads, screen by length and create representative sequences
- Align representative sequences with reference alignments. Default SILVA seed.
- Denoise to remove poor alignments
- Remove Chimeric sequences.
- Classify the sequences and do post-classification QC.
- Estimates the sequencing error rate.

bash ./code/03_imapClassifySEQ_driver.bash

You may see a lot of WARNINGS, It is safe to ignore them.

The program is set to remove all temporary files after completeing processing the sequences. If no file found you may see an error message that reads: rm: cannot remove '.temp': No such file or directory*

OTU clustering, Taxonomy assignement and preliminary analysis (Mothur)

Phylotype-based method (works for any dataset size).

bash ./code/04_1_phylotype_driver.bash

OTU-cluster method (works best for small dataset).

bash ./code/04_2_otucluster_driver.bash

Phylogeny-based method (works best for small dataset).

bash ./code/04_3_phylogeny_driver.bash

Sequence processing progress report

containerName=report3
docker run --rm --name=$containerName -it -v $(pwd)/iMAP:/imap --workdir=/imap  tmbuza/rpackages:v3.5.2 /bin/bash

bash code/progressreport3.bash
exit

Data Transformation

containerName=datatransformation
docker run --rm --name=$containerName -it -v $(pwd)/iMAP:/imap --workdir=/imap  tmbuza/rpackages:v3.5.2 /bin/bash

bash code/datatransformation.bash
exit

OTU analysis progress report

containerName=report4
docker run --rm --name=$containerName -it -v $(pwd)/iMAP:/imap --workdir=/imap  tmbuza/rpackages:v3.5.2 /bin/bash

bash code/progressreport4.bash
exit

Statistical analysis

Statistical analysis compares the variables, and variables are very specific and unique in different studies. Below are links to most important statistical analyses in microbiome studies:

QIIME2: Sequence Processing and Classification

Requires a QIIME2 trained classifer.
You can train your own classifier using the q2-feature-classifier.
Classifier: Naive Bayes classifiers trained on GreenGenes database with 99% OTUs.
Download pretrained classifiers for QIIME2 sequence classification:
- The 515-806 conservative fragments
  - iMAP default due to its small size.
  - Can be spanned by sequencing 200–300 nt from both ends using Illumina MiSeq.
- Alternative pretrained classifiers are available including SILVA and Full length greengenes (see link on Table 1).

Download 515-806 conservative fragments

bash iMAP/code/qiime2/qiime2_gg_classifier_fragments.bash

Download full length greengenes classifier

If using full length greengenes or any other pretrained QIIME2-formatted classifiers you must replace the default settings in the executable file (see details below).

bash iMAP/code/qiime2/qiime2_gg_classifier_fulllength.bash

Below is a location and the file to be altered. Find and replace "gg-13-8-99-515-806-nb-classifier.qza" string with the name of your favorable classifier.

Parameter to change	Filename	Default
Classifier	iMAP/code/qiime2/qiime2.bash	gg-13-8-99-515-806-nb-classifier.qza

Create QIIME2 container

containerName=qiime2classification
docker run --rm --name=$containerName -it -v $(pwd)/iMAP:/imap --workdir=/imap  tmbuza/qiime2core:v2019.1 /bin/bash

bash code/qiime2/qiime2.bash
exit

View QIIME 2 results

Output path: iMAP/data/qiime2/results/

Use client-side interface: https://view.qiime2.org/ to view the results.

Simply drag and drop the QIIME 2 artifacts (.qza files) or the visualizations (.qzv files).

For more help visit https://view.qiime2.org/about.

Useful commands

1. Convert mothur biom file within QIIME2

The output is a file containing OTUs and taxonomy

containerName=biomconvertmothur
docker run --rm --name=$containerName -it -v $(pwd)/iMAP:/imap --workdir=/imap  tmbuza/qiime2core:v2019.1 /bin/bash

bash code/qiime2/convertmothur_biom.bash
exit

AHdeRojas / iMAP