mb-pipeline

Wet lab protocols

DNA extraction
Primary PCR. One of:
2A. trnL (plant)
2B. 12SV5 (vertebrate animals)
Dilution
Indexing PCR

Computational pipeline

These scripts produce, from Illumina MiniSeq raw sequencing data, a directory named DATE_results with the following structure:

DATE_results
 |
 +-- 0_raw
 |    
 +-- 1_trimadapter
 | 
 +-- 2_filterprimer
 |    
 +-- 3_trimprimer
 |    
 +-- 4_dada2

Requirements

A compute cluster with SLURM job submission
Have downloaded the metabarcoding Singularity container: https://gitlab.oit.duke.edu/lad-lab/metabarcoding/-/tree/main

Metabarcoding Pipeline Tutorial

Demultiplexing
Trim adapters
Filter primers
Trim primers
Data analysis object generation and QC 5A. Submission script 5B. Rscript 5C. Write Rout file

Setup

Clone this repo to the computing cluster:

#navigate to where you want to store the scripts
git clone https://github.com/bpetrone/mb-pipeline.git

After getting your data off of the sequencer, upload it to the computing cluster. For example, here is how you would upload to the DCC:

#"IlluminaRunFolder will look something like "211019_MN00462_0194_A000H3L2M7"
scp -r /path/to/your/<IlluminaRunFolder> <your-netid>@dcc-login.oit.duke.edu:/path/to/DCC/folder

Next, upload your samplesheet.csv file and make sure that you have the following file structure:

/seqdata #name this whatever you want
  -########_samplesheet.csv 
  -211019_MN00462_0194_A000H3L2M7 #note that this and sample sheet should be in the SAME folder

See "Troubleshooting" for more information regarding sample sheet structure if you aren't sure what an Illumina sample sheet should look like.

You will also need to download the metabarcoding container file to the computing cluster you're using with the following command:

#navigate to whatever directory you want to store the container in
curl -O https://research-singularity-registry.oit.duke.edu/lad-lab/metabarcoding.sif

This is a Singularity (aka computing cluster compatible) container that has all of the packages you need for analysis pre-installed, so no other package installations are needed!

See the below instructions for how to run each step of the pipeline. There is also a template file that will write each submission script for you if you just want to copy/paste the correct commands.

Step 1: bcl2fastq and demultiplexing

The first step of this pipeline is to convert the raw .bcl files from the sequencer into individual .fastq files for each sample. You will need:

The path to your Illumina run folder
Sample sheet name
Path to where you stored the metabarcoding.sif container file
The diet metabarcoding run type (trnL or 12SV5)

#navigate to the mb-pipeline folder on your computing cluster
sbatch --mail-user=youremail@duke.edu 1_demux-barcode.sh /path/to/metabarcoding.sif path/to/miniseq-dir XXXXXXXX_sample-sheet.csv <trnL>OR<12SV5>

This can take about an hour to run with 192 samples.

After this script has finished, you should see the following file structure:

/seqdata 
  -########_samplesheet.csv 
  -miniseq-dir
    -XXXXXXXX_results
      -0_reference
        -primers.txt
      -1_raw_demux
        #all of your demultiplexed .fastq files will be here
      -1_raw_all
        Undetermined_S0_L001_R1_001.fastq
        Undetermined_S0_L001_R2_001.fastq # these files contain everything, included PhiX and all reads that didn't match to the barcodes you input

Check the .out and .err files to make sure that everything went smoothly

Troubleshooting

If demultiplexing failed, here are the most common issues:

Incorrect file structure: is your sample sheet located in the same folder as your Illumina run folder?
Incorrect sample sheet input: Does the input sample sheet name match what you have in the folder?
Incorrect sample sheet format: bcl2fastq requires a very specific sample sheet format. If there is an empty row or hidden character where there shouldn't be, then it will give an error. Sometimes, it's necessary to copy and paste the sample sheet from a run that has worked in the past (or, use this sample) and then re-paste in the barcodes you used.
Not using exactly "trnL" or "12SV5" as the run type argument. This input is case sensitive, make sure you use exactly "trnL" or "12SV5" as the last argument.
Other file path issues: double check that all of the file paths are correct and that there aren't any missing back slashes. Did you correctly type the miniseq-dir path?
Trying to submit command from outside /mb-pipline/code folder. If you aren't submitting the sbatch command from inside the mb-pipeline folder, you will need to encode the path to the script in your sbatch command. I.e.;

sbatch --mail-user=youremail@duke.edu /PATH/TO/1_demux-barcode.sh /path/to/metabarcoding.sif path/to/miniseq-dir XXXXXXXX_sample-sheet.csv <trnL>OR<12SV5>

Step 2: Trim adapters

- Input files: Demultiplexed and .fastq files - Output files -- Trimmed .fastq files

Step 3: Filter primers

Input files: Demultiplexed and trimmed .fastq files
Output files -- Filtered .fastq files -- .out file for each sample

Step 4: dada2

Some final quality control filtering and QC plot generation is done at this step.

Input files

3_trimprimer directory containing appropriately trimmed/filtered read files

Output files:

quality_F.pdf
quality_R.pdf
quality_F_summary.png
quality_R_summary.png
dada_errors_F.png
dada_errors_R.png
dadaFs.rds
dadaRs.rds
mergers.rds: merged forward and reverse reads
concats.rds: includes concatenated reads
seqtab.rds
seqtab_concats.rds
seqtab_nochim.rds: chimeras removed
track.rds: counts up how many reads were lost at each dada2 step due to filtering
seqtab_nochim_concats.rds
track_pipeline.csv: table of how many reads were present at each of the preceding pipeline steps
track_long.csv: R data table formatted table with the number of reads for each sample at each pipeline step, including dada2. Allows QC plotting

bpetrone / mb-pipeline