ZeweiSong / metaSeq

Python tools for metagenomic data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Preprint at bioRxiv

MetaSeq (Early access stage)

This is a sequencing data processing pipeline mainly implemented with sing-tube Long Fragment Reads (stLFR) technology.

Benchmark repository: :octocat: benchmark4stcLFR

Installation

Note: the biogit is an internal website and only accessable from intranet at present.

Prerequisites

  • python >= 3.6
  • perl >= 5
  • metabbq ( dev repo ) - "METAgenome Bead Barcode Quantification", which is a launcher to initiate workdir and calling sub functions.
  • cOMG ( dev repo ) (replaced by fastp)
  • fastp (dev repo ) - dev version mandatory since I've modified fastp a new module to handle the split barcodes process
  • Mash (dev repo ) - dev version mandatory since I've modified it to fit stLFR data
  • Community ( dev repo ) - Louvain method: Finding communities in large networks
  • Snakemake - a pythonic workflow system.
  • blast - The classic alignment tool finding regions of similarity between biological sequences.
  • Assemble methods
    • SPAdes - SPAdes Genome Assembler
    • MEGAHIT - An ultra-fast and memory-efficient NGS assembler

I recommend to install above tools in an virtual env via conda:

  1. create and install part of them:
conda create -n metaseq -c bioconda -c conda-forge snakemake pigz megahit blast
source activate metaseq
  1. According to the corresponding documents, install fastp, SPAdes and community, etc. under env metaseq

Make sure above commands (executables) can be found in the PATH.

Get the launcher: metabbq
3. Install metaSeq pipeline to get metabbq:

cd /path/to/your/dir
git clone https://github.com/ZeweiSong/metaSeq.git
export PATH="/path/to/your/dir/metaSeq":$PATH

I haven't yet write any testing module to check abve prerequesites. At present you may need to test it yourself.

Usage

Prepare configs

cd instance
metabbq cfg  

This command will create a default.cfg in your current dir. You should modifed it to let the launcher know the required files and parameters

Initiating a project Prepare an input.list file to describe the sample name and input sequence file path.

metabbq -i input.list -c default.cfg -V

By default, the metabbq will create a directory with the name of {sample} and a sub-directory named input under it.

Run Quality-Contorl module

metabbq smk -j -np {sample}/clean/BB.stat
# -j make the jobs execuated paralled under suitable cores/threads
# -n mean dry-run with a preview of "what needs to be run". Remove it to really run the pipeline.

Run precluster-assmble module

You need to select a assemble tool in the configure file and the corresponding output file name in following:

metabbq smk -j -np {sample}/summary.BC.megahit.contig.fasta
metabbq smk -j -np {sample}/summary.BC.idba.contig.fasta
metabbq smk -j -np {sample}/summary.BC.spades.contig.fasta

Run Isolate-Bead-assmble module

You need to select a assemble tool in the configure file and the corresponding output file name in following:

metabbq smk -j -np {sample}/summary.BI.megahit.contig.fasta
metabbq smk -j -np {sample}/summary.BI.idba.contig.fasta
metabbq smk -j -np {sample}/summary.BI.spades.contig.fasta

Troubleshooting

Feedback are welcome to submit in the issue page.

About

Python tools for metagenomic data

License:GNU General Public License v3.0


Languages

Language:Perl 55.3%Language:Python 38.6%Language:Shell 6.1%