This repository contains the pipeline mg_clust. This tool is a command line application programmed in BASH and AWK, dedicated to the computation of Operational Protein Units (OPUs) in metagenomic data, based on Open Reading Frame (ORF) amino acid sequences. It takes as an input a sampling set of preprocessed (unassembled) metagenomic samples and outputs the OPUs abundance table (see figure below).
Dependencies:
MEGAHIT
BWA
SAMTools
BEDTools
Picard
MMSeqs2
FragGeneScanRs
GNU Parallel
To see the help run ./mg-clust.bash --help
Usage: ./mg-clust.bash <options>
--help print this help
--assem_dir CHAR directory with previously computed assemblies (format dirname/SAMPLE_NAME/SAMPLE_NAME.contigs.fa)
--assem_preset CHAR MEGAHIT preset to generate assembly (default meta-sensitive)
--compress t|f compress all output data (default f)
--clean t|f clean up intermediate data (default f)
--input_dir CHAR directory of input metagenomes
--logs_file CHAR file name to save parallel logs
--nslots NUM number of threads used (default 12)
--njobs NUM number of jobs to run in parallel (each job with nslots) (default 3)
--min_contig_length NUM minimum length of contigs (smaller than this will be discarded; default 250)
--min_opu_occup NUM minimum OPU occupancy (smaller than this will be discarded; default 2)
--min_orf_length NUM minimum length of ORFs (amino acids); ORFs shorter than this will be discarded (default 60)
--output_dir CHAR directory to output generated data (default metaclust_output)
--overwrite t|f overwrite previous folder if present (default f)
--reads1_suffix CHAR suffix of R1 reads
--reads2_suffix CHAR suffix of R2 reads
--run_module_1 t|f run the first processing module (assemble and map reads; this module will fail if folder output-1 exists; default t)
--run_module_2 t|f run the second processing module (predict ORFs and compute ORFs coverage; this module will fail if folder output-2 exists; default t)
--run_module_3 t|f run the third processing module (concatenate data and create ORFs db; this module will fail if folder output-3 exists; default t)
--run_module_4 t|f run the fourth processing module (cluster ORFs and compute clusters abundance; folder output-4 will be kept if present; default t)
--servers CHAR,CHAR comma separated list of servers to run metaclust
--train_file_name train file name used to run FragGeneScan (default illumina_1)
--thres_range NUM,NUM minimum and maximum clustering thresholds separated by comma (default 0.7,0.9)
--thres_step NUM threshold sequence step (default 0.1)