Pinjontall94/asd-q2

asd-q2 (Real name TBD)

This is a simple, snakemake-based pipeline that takes an NCBI accession list of a given SRA#, and performs de novo OTU clustering via qiime2's vsearch wrapper. End results are given in Qiime2 Artifacts (.qza files), though these can be extracted the same as any .zip file, if you so choose

Preparation, or "Before you run snakemake"

Clone this repository and create & activate a new conda environment with the provided environment file

git clone --depth 1 git@github.com:Pinjontall94/asd-q2.git /your/new/analysis/folder
mamba env create -f environment.yaml 
conda activate snakeqiimer

Note: the standard conda tool that comes with Anaconda will work, but as Snakemake itself recommends, I highly encourage you to use mamba (whether on its own, or via the mambaforge distribution)

Download the NCBI Accession List (e.g. "SRR_Acc_list.txt") and move it into the asd-q2 folder
Run the following in the asd-q2 folder:

python scripts/srr_munch.py -i SRR_Acc_List.txt -o data

Modify the config file ("config.yaml") to fit your analysis Update the following parameters, in plain text, unless otherwise specified:

"AUTHOR": a string containing no spaces (e.g. "Franklin_53")
"primers", "FWD" and "REV": integer values only (e.g. FWD: 5)
Optional: "offset", FWD or REV for 5' and 3' bp-wise offsets, respectively
Optional: "THREADS", specify the number of CPU threads to allocate to the pipeline (e.g. THREADS: 8)

Example config:

AUTHOR: "Franklin_53"

primers:
  FWD: GTGCCAGCMGCCGCGGTAA
  REV: ATTAGASACCCBDGTAGTCC

# Number of nucleotides to trim from reads' 5' (FWD) and 3' (REV) ends
offset:
  FWD: 5
  REV: 4

THREADS: 8

Optional: Visualize the pipeline

Note: Requires graphviz is installed

(snakeqiimer) /your/new/analysis/folder/asd-q2 $ snakemake --dag | dot -Tsvg > dag.svg

Run the pipeline

Locally / On your device:

Run with:

(snakeqiimer) /your/new/analysis/folder/asd-q2 $ snakemake -cN  # where N = number of cores

Your output files will be stored in a newly made "OTUs" folder

Pipeline Stages

Download and unzip all fastq.gz's listed in the accession list as SRR numbers, and place them in a "data" folder
Generate a Qiime2-compatible manifest file for the resulting fastqs Note: Only tested on PHRED33 fastqs
Import Seqs
Merge paired-end reads with q2-vsearch's join pairs
Dereplicate the SampleData[Sequences] artifact
De novo cluster FeatureTable[Frequency] and FeatureData[Sequence] artifacts
Generate FeatureTable and FeatureData summaries
Create a tree for phylogenetic diversity analyses
Determine alpha and beta diversity

TODO

Add conditional to handle all PHRED values compatible with Qiime2
Add rule for Qiime2 that uses the artifact api
Add examples folder showing sample workflows
Organize rules into a separate folder? (Maybe not necessary)
Add instructions for running remotely via slurm and/or GCP ()

Pinjontall94 / asd-q2