The Cookiecutter template for all CAMP (Core Analysis Metagenomics Pipeline) modules.
- Standardized Snakemake workflow with preset working directory structures and input/output formats for multi-sample metagenomics data
- Click command line interface for streamlined parameter management
- Pre-packaged Conda environment YAMLs for easy environment setup and conflict-free dependency management
- Centralized parameter and computational resource management
- Integrated Slurm (HPC cluster job submission) and command-line modes
- Included test metagenomics sequencing dataset for installation checking
- Pre-configured version bumping with a single command
These instructions are only for developers that want to create an module for a new analysis purpose i.e.: ingests new formats of input or output data. To use or extend existing analysis modules, see TBA.
- Have conda and Cookiecutter (version 1.4.0 or higher) installed in some environment.
conda install -c conda-forge cookiecutter # or...
pip install -U cookiecutter
- Use this template to generate a barebones CAMP analysis module and follow the prompts.
cookiecutter https://github.com/MetaSUB-CAMP/CAMP_Module_Template
- Set up the module environment.
conda create -f configs/conda/module.yaml
conda activate module
-
Develop Snakemake rules to wrap your analysis scripts and/or external programs. There is an example (sample_rule) and three rule templates in
Snakefile
as guidelines.- To write to log files, add
> {log} 2>&1
after shell commands, unless the program writes to standard output. In that case, use2> {log}
. For commands inrun
(i.e.: built-in Python script instead of shell), see the Python example inworkflow/Snakefile
. - If you're using external scripts and resource files that i) cannot easily be integrated into either
utils.py
orparameters.yaml
, and ii) are not as large as databases that would justify an externally stored download, add them toworkflow/ext/
orworkflow/ext/scripts
. An example of their application can be found inrule external_rule
.
- To write to log files, add
-
Customize the
make_config
rule inSnakefile
to make your final outputsamples.csv
as well as return any other analysis files you might want into thefinal_reports
directory.- If you plan to integrate multiple tools into the module that serve the same purpose but with different input or output requirements (ex. for alignment, Minimap2 for Nanopore reads vs. Bowtie2 for Illumina reads), you can toggle between these different 'streams' by setting the final files expected by
make_config
using the example functionworkflow_mode
.
- If you plan to integrate multiple tools into the module that serve the same purpose but with different input or output requirements (ex. for alignment, Minimap2 for Nanopore reads vs. Bowtie2 for Illumina reads), you can toggle between these different 'streams' by setting the final files expected by
-
Set up a cleanup function in
workflow/utils.py
to get rid of large intermediate files (ex. SAMs, unzipped FastQs).
-
Customize the structure of
configs/samples.csv
to match your input and output data, and theningest_samples()
inutils.py
to properly load them.- The example here summarizes Illumina paired-end FastQs and an a set of de novo assembled contigs in a FastA.
- Update the description of the
samples.csv
input fields in the CLI.
-
Fill out your module's work subdirectory structure in
utils.py
, specificallydirs.OUT
, which is where all of the intermediate and final output files go, anddirs.LOG
, which is where all of the logs go. Try to make as many of your work directory's tree structure as possible. -
Add any workflow-specific Python scripts to
utils.py
so that they can be called in workflow rules. This keeps theSnakefile
workflow clean.- Note: Python functions imported from
utils.py
intoSnakefile
should be debugged on the command-line first before being added to a rule because Snakemake doesn't port standard output/error well when usingrun:
.
- Note: Python functions imported from
-
If applicable, use symlinks in
utils.py
between your (original) input data as described insamples.csv
to the temporary directory (dirs.TMP
) so that they're easy to find and won't be destroyed.- To support relative paths for input files, the symlinking example uses
abspath()
. However, this will only work if the input files are in subdirectories of the current directory.
- To support relative paths for input files, the symlinking example uses
-
Some of the analysis scripts and/or external programs will probably consume a lot of threads and RAM. Customize the amount of memory and CPUs allocated to each rule in
configs/resources.yaml
. -
Some rules in the module probably use constants for
shell
orrun
parameters. Add these toconfigs/parameters.yaml
for easy toggling. -
If applicable, update the default conda config using
conda env export > config/conda/module.yaml
with your tools and their dependencies. -
Some of your analysis scripts and/or external programs (ex. R-based scripts) will probably have dependencies that conflict with the main environment. To handle this, create a new environment and make a new conda YAML under
configs/conda
. To use it, see the usage ofconda
option infirst_rule
for an example.
-
Add your module's basic installation and running instructions to the
README.md
. Then, add complete documentation to the CAMP documentation repo. -
Make the default conda environment, and after setting the appropriate input from
test_data/
intest_data/samples.csv
, and parameters/resources intest_data/parameters.yaml
andtest_data/resources.yaml
respectively, run the module once through to make sure everything works.- If none of the test data available is appropriate, please contact us so we can coordinate the addition of new tst data.
- The default number of cores available to Snakemake is 1 which is enough for test data, but should probably be adjusted to 10+ for a real dataset.
- Relative or absolute paths to the Snakefile and/or the working directory (if you're running elsewhere) are accepted!
python /path/to/camp_module/workflow/module.py test --cores 40
-
Trim down the data in
test_data/
so that only the necessary and sufficient input data are present. -
Remove any test data files that are larger than 100MB because Github repos will not allow those to be pulled to remote.
The configs/conda/
directory also contains the YAML that sets up a dataviz environment that (for now) supports Jupyter Notebooks and seaborn-based plotting. You can include a Jupyter notebook that generates preset visualizations for your module's output.
If you want your module integrated into the main CAMP module, please contact Lauren or Braden!
- Please make it clear what your module intends to do by including a summary ex. "Module A Release X.Y.Z, which does B to input C and outputs D").
- This package was created with Cookiecutter as a simplified version of the project template.
- Free software: MIT
- Documentation: Coming soon!