snakemake-workflows / docs

Documentation of the Snakemake-Workflows project

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Workflow auto-download (specified in config.yaml)

Avsecz opened this issue · comments

snakemake workflows is a great idea! Thanks for putting it together.

API Suggestion

What do you think about having a command line tool for running any snakemake workflow with the following user API

  • sworkflow config.yaml [other snakemake flags]

config.yaml would be the standard config.yaml with an additional workflow entry specifying which workflow to use (as a string):

workflow: https://github.com/snakemake-workflows/single-cell-rna-seq/tree/a1be3b6b389b009d91bb1d7f75abc1b5a23cd19d

# The usual config.yaml ----------------------------------
# path to sheet describing each cell.
cells: cells.tsv

# specify count table (rows: genes/transcripts/spikes, cols: cells)
counts: counts.tsv
...

This functionality is conceptually similar to snakemake rule wrappers, where you refer to a command with a single string.

Implementation

sworkflow command would do the following:

  1. Git checkout the workflow source code to a common location (~/.snakemake/workflows/<myworkflow>?)
  2. Run the snakemake command: snakemake --snakefile ~/.snakemake/workflows/<myworkflow>/Snakefile [other snakemake flags]

Motivation

This command would come handy in case you want to apply a single workflow multiple times (say you are analyzing different but related datasets). In the current case, you'd need to checkout the source-code to each directory.

The thing is that the git checkout provides various benefits that you loose with such a solution:

  • tracking of code changes to the workflow
  • tracking of config changes
  • currently all of the workflows here not only rely on a config file but also on sample sheets
  • the --archive and --kubernetes functionality of Snakemake rely on the workflow code to be tracked in git
  • merging code changes back into the master branch so that others can benefit

Also note, that already now you can share a common version of a workflow between different working directories with snakemake --directory myworkdir.

I agree with all your arguments in case you apply the workflow in a single project or if you tweak the workflow's source code specifically for that project. However, in case you want to apply the workflow as is (without tweaking) to multiple projects tracked in different git-repositories, this would result in unnecessary code duplication. I think the second part of your answer is potentially solving the described problem. For that scenario of multiple projects and a shared workflow, would you suggest doing the following (?):

cd my-project
snakemake --directory=. --snakefile=~/workflows/myworkflow1 --config=config.yaml

This would separate the source code of the project from the workflow.

  • tracking of config changes
  • currently all of the workflows here not only rely on a config file but also on sample sheets

The config.yaml and the corresponding sample sheets would be tracked in the repository of the individual project.

  • merging code changes back into the master branch so that others can benefit

Contributing back would be easier since the code would be already stored in the original repository and wouldn't have to be copied from the project repo to the workflow repo.

Yes, indeed. In such a scenario, this sounds like the right way to do it. Note that --directory=. and --config=config.yaml is superflous if you already are in the desired working directory.

I now also think that at this stage, the simple snakemake command would be enough as the number of different workflows a user would want to apply is not very high. I'll close the issue. thanks!