Best-paractice of cross-workflow specification of files

Question

Best-paractice of cross-workflow specification of files

SilasK opened this issue a year ago · comments

Silas Kieser commented a year ago

I would like to discuss what is the best way to specify files in a way that they can be used across workflows.

Take the example of two workflows e.g

Workflow 1: reads --> assembly

Workflow 2: assembly + reads --> assembly statistics ...

What is the best way to specify the reads and assembly so that they can be used by different workflows?
Take into account that
Requirement A: The reads might be used at multiple places in Workflow 2.
Requirement B : The reads are probably to be used to infer the total number of samples in the target rule.

With sub-workflows, it would be possible to define otherworkflow(file)

But I think the recommended way now is to use modules and to import the rules Workflow 1 and 2 in a new workflow.
But then I should know which rules I need to modify to adapt the file specification. This should be necessarily defined in the Readme of a workflow.

I don't see how this can be done without massive modifying many rules of an imported workflow.

Any thoughts?

ningOTI · Answer 1 · Wed May 31 2023 04:34:33 GMT+0800 (China Standard Time)

Here's a first attempt:

Workflow 1 input reads are determined by YAML configuration file, and the final assembly file is tagged either in its contents e.g. header lines, or filename; with a hash representing the input reads used to generate it e.g. hash of read hashes.

Workflow 2 takes input reads and input assembly also by YAML configuration file. It checks either on each run or through a dummy output that the input assembly's information about which input reads were used to generate it matches with the set of input reads it was given.

Silas Kieser · Answer 2 · Wed May 31 2023 17:14:43 GMT+0800 (China Standard Time)

Your idea would be to define the path to the files

Something like:

config.yam

read_file_format: "QC/qc_reads/{sample}_{fraction}.fastq.gz"
assembly_file_format: "Assembly/assemblies/{sample}.fasta.gz"

Silas Kieser · Answer 3 · Wed May 31 2023 17:18:10 GMT+0800 (China Standard Time)

One could also use a tsv file in which we will specify the headers in a config file.

Ideally using the https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#configuring-scientific-experiments-via-peps