Snakemake workflow for comparitive genomics

The snakemake workflow takes whole genome sequencing data, including illumina and Nanopore reads, to assemble to draft genomes. A compartive genomics analysis would be applied on all these draft genomes.

Currently the workflow is still in the initial stage. Please do not use the workflow on publication.

The figure below shows the steps for this workflow.

Install the workflow management system snakemake

Snakemake is a Python based language and execution environment for GNU Make-like workflows. The workflow is defined by specifying rules in a snakefile. Rules further specify how to create sets of output files from sets of input files as well as the parameters and the command.

Snakemake automatically determines the dependencies between the rules by matching file names. Please refer to snakemake tutorial page to see how to define the rules.

We have to make sure whether snakemake is installed in your system before we start.

UCR hpcc

UCR hpcc have snakemake installed in environment module. If you're on UCR hpcc you can simply load snakemake module by the following command.

module load snakemake

If you're not on UCR hpcc

If you're NOT on UCR hpcc and you don't have snakemake in the enviornment module, please follow the below steps to create a snakemake conda environment.

First, create an environment named snakemake.
```
conda create -n snakemake
```
After create the environment, activate it by:
```
conda activate snakemake
```
Install the package mamba, which is a faster version of conda.
```
conda install -c conda-forge mamba
```
After mamba being installed, you can later switch from conda install [package] to mamba install [package] to speed up the package installation.
Next, install the package snakemake through mamba.
```
mamba install snakemake
```
Then you can execute the snakemake --help to show the snakemake helping page.

Clone the workflow

Clone the repo to your computer.

Clone by the following command if you're using public key for github connection.

git clone --recurse-submodules git@github.com:chtsai0105/smk-compartive_genomics.git

Or clone by https link.

git clone --recurse-submodules https://github.com/chtsai0105/smk-compartive_genomics.git

Otherwise, clone the submodules as a second step by:

cd smk-compartive_genomics
git submodule update --init

Next, go to the directory by cd smk-compartive_genomics. The entire folder structure and its details are listed below:

.
├── config/
│   ├── config.yaml             # Define the path for data and metadata.
│   ├── sample.csv              # The metadata for samples. Define the names of the samples and the fastq files.
├── workflow/
│   ├── rules/                  # The folder that contains the rules/submodules of the workflow.   
│   ├── envs/                   # The folder that contains the yaml config for conda environments.
│   ├── wrappers/               # The folder that contains the self-defined wrappers.
│   ├── snakefile               # The workflow entrypoint. Define the targets for the workflow.
├── data/                       # The folder for the data and the workflow outputs.
│   ├── fastq/                  # The folder that the initial fastq files should be placed.
├── slurm/                      # The folder that contains the slurm profile for batch partition@UCR hpcc.
└── run_snakemake.bash          # The bash script for running the workflow.

Configure the workflow

You can edit the config.yaml to setup the behavior of the workflow.

The key metadata refers to the sample.csv, which have all the details of the sample.

The key partition is used to define the partition parameters. By defining it here we can avoid hardcoded the partion name directly in the script. Right now we have a highmem partition for the assembly process.

The other keys represent major steps in the workflow. You can switch on/off a particular step by changing the subkey run to True/False. You can also change the output path in some of the steps.

Define the samples

You should properly defined your metadata, which is recorded in the sample.csv, before running the workflow.

There are 4 columns in this csv table - sample, R1, R2 and interleaved.

The column sample defines the sample name. You can change it to names which are more distinguishable instead of accession numbers. These names will also being used as the wildcard to define the workflow outputs.

The column R1 and R2 defines the fastq file names you put in the folder data/fastq. If the sequencing data is in interleaved format. Please leave the R1 and R2 blanked and fill the interleaved file names to the interleaved column.

Please make sure they are identical to the fastq files you have otherwise the workflow may have trouble to input the files. Please also confirm that the names in each column are unique.

Run the workflow

After compiling the template and setup the paramters, the next step is to run the workflow.

Snakemake provides a dry-run feature to examine the workflow before truly running it. You should always test the workflow beforehand to make sure it execute as expected by the following command:

snakemake -np

After confirming all the steps. You can run the workflow by executing the script run_snakemake.bash or the following command:

snakemake -p --profile slurm

Note that there are additional parameter defined in the slurm profile, please refer to smk_profile-slurm for more details.

chtsai0105 / smk-compartive_genomics