zavolanlab / zarp

The Zavolab Automated RNA-seq Pipeline

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Partial execution of the workflow

dominikburri opened this issue · comments

Is your feature request related to a problem? Please describe.
The workflow is currently executing alignment with STAR and pseudoalignment with Salmon and Kallisto.
Sometimes, not all the results are needed and one does not want convolute the output with unused files.
Furthermore, executing task which are not needed causes computational strains (CPU time, memory usage, disk space) which could be avoided.

Describe the solution you'd like
One solution could be to specify in the config which aligners should be executed.
The default being the current status (STAR, Salmon and Kallisto).

Describe alternatives you've considered

  • Alternatively, different versions of ZARP could be used, with pre-defined and documented behaviour. E.g. STAR only, pseudoalignment only, all.
  • The user could simply remove the rules and statements that they don't need.

Additional context
The problem with this more dynamic execution pattern could be with the multiQC collection. Like throwing errors for missing files. Though I didn't check this.

Thanks for the comment! What you are suggesting is more flexibility in the tools used for performing certain steps and optional execution of these. Although this could be done given the current setup, I am not sure that snakemake and conditionals (further complicating the configs and the accompanying cli) is really the way to go for further development and is for sure outside of the current scope. I suggest we flag this as future and revisit it at a later stage though. Perhaps implementation of unstranded processing and optimising time scaling is of higher priority and could be looked at if we currently have the manpower and resources.

Yes I agree, it is not of high priority. But it's good that we are aware of it and have it on the radar.

For the complexity: I don't know how much work it actually is, because we could simply gather only parts of the files without big side effects. The exception being multiQC, for which I don't know how it interacts with the files.
So the feature could simply be a couple of "if" statements with differing collection rule "all".

I gotta admit that I'm (also) a bit skeptical. Krini is opinionated, and I think that's just fine. Pretty much every workflow is. It's certainly not (and not supposed to be) a "God workflow".

We set out to include steps that are almost always needed, in our opinion, to gain quick insights into a library and produce the types of files that are required for typical downstream analyses. No more, no less. The only thing I could see is that we either decide for only one of Salmon and Kallisto, or give users a choice for this particular tool class, because indeed users will probably rarely need both.

Regarding the resource use argument: If users need to run Krini many times, like hundreds or thousands or perhaps even hundreds of thousands of times, optimizing compute and storage needs is indeed a priority. But in that case, pruning the workflow to just the absolutely required steps is just one (rather easy) aspect out of many, along with caching, compression between steps, optimized parallelization, streaming/piping, hardware optimization etc, obviously depending on the degree of scaling. Just putting a few conditionals won't help much with that.

Given you mean ZARP when writing Krini, I follow.
I agree, that some decisions are taken and carried out. And I also lean into keeping things as is. But the question of executing only part of the workflow came up.
I was thinking of giving the option for the tool class (alignment or pseudoalignment). But this would require more work, because e.g. PCA is only carried out on pseudoalignment output.
About the resources: some things you can't change, like the hardware optimisation, parallelisation behaviour etc. And those we don't need to deal with. But something you can change, namely which jobs get executed. If a simple trick would do it, why not consider it? Why waste those resources?
And besides the aforementioned resources, one also gets billed for using these resources. I guess this is most visible when executing in the cloud.
Of course, as you write, this will only really weight in once executing ZARP a couple of times.

My main concern is that one can quickly end up going down a rabbit hole trying to cater to individual use cases. Where do you want to stop? Do you want to make every step optional? And what about dependent steps? Or just the nodes of the DAG? The issue description is not really concrete on what exactly you want to do, what's in scope and what isn't.

You are asking, basically, why not add a feature if it's easy to add. Apart from the above concern of not knowing where to stop, every feature/option makes a tool harder to use; when running it, when consuming its outputs (files may or may not be there, may be created based on different assumptions etc) and when interpreting it. For example, if a paper says it ran ZARP, as it is, everyone knows it's running this set of tools. That changes if we introduce conditionals. Then the question will be: well, which parts of ZARP were run? (Of course, with the "rule config", we already have this situation in the sense that nobody necessarily knows how a given tool was run; but at least you know that it was run)

Anyway, you say/imply that you have a concrete use case. Do you maybe wanna describe it? It may help limit the scope and turn the discussion from philosophical to technical :)

To make it concrete, one suggestion (basically the solution I gave in the description):
Three modes:

  • ZARP with all aligners
  • ZARP with STAR aligner only
  • ZARP with pseudoalignment (salmon and kallisto) only

And about the paper saying it ran ZARP. As we know, the parameters can be as important as the tool itself. If mapping was performed with another genome version, what use is it? How can I find it out? You need this information anyway - as you write anyway.
Also, the code and rules can (and will be) adjusted and modified as needed.

But as I already wrote two times, I'm fine not following on this. But to be aware and have some idea on how to possibly approach it - if to deal with it anyway.

Closing this now as it's not going to be implemented soon, might be reopened in the future.