Ideas for implementing a "staged" build process

Question

Ideas for implementing a "staged" build process

dfm opened this issue a year ago · comments

Something that came up a few different times in the hack day yesterday was that it would be nice to have a "staging" (working title) feature, where users could explicitly choose to save or restore from a granular cache system. This could be one approach to something like the "draft" mode discussed in #298, or a next step for #314.

So as not to forget, I wanted to sketch the idea (+ some implementation thoughts) here. The general idea is motivated by the fact that a lot of hack day participants said that they find the caching workflow somewhat counterintuitive and that a barrier to entry is the long (and sometimes unpredictable) run times when using SYW with large datasets or during fast writing time, where you don't care so much about the figure precision. Our proposed solution was to collect steps of the build process into "stages" that can be saved as snapshots (to Zenodo, or even an orphan Git branch, depending on the use case) and then explicitly restored at build time via a command line interface.

The stages that we proposed were:

Script dependencies: for example, external or intermediate data files
Generated artifacts: figures and other resources (TeX tables, etc.) required to build the document

But, we would like to implement a feature that could support different structure.

Then the interface could be something like:

showyourwork snapshot "artifacts"

to save a snapshot of the "artifacts" stage, and then

showyourwork build --restore="artifacts"

to run a build where the artifacts are restored from the most recent snapshot and the build proceeds using those figures or fails with an error, rather than trying to generate the missing artifacts. (The specifics of the interface are very much up for discussion!)

Inside Snakemake, a possible interface would be something like:

rule generate_a_figure:
    output:
        staged("path/to/figure.pdf", stage="artifacts")

where staged has the following implementation:

STAGES = {}
def staged(*files, stage="default"):
    STAGES[stage] = STAGES.get(stage, [])
    STAGES[stage].extend(files)
    if stage in RESTORE_STAGE:
        return []
    else:
        return list(files)

Where RESTORE_STAGE is a list or set of stage names to restore, constructed from the configuration settings. Then, after all rules that use staged are defined, we'd need to add some other rules to restore from the cache:

for name, files in STAGES.items():
    if name in RESTORE_STAGE:
        rule:
            name:
                f"restore_stage_{name}"
            output:
                files
            # ... + some implementation to restore from a snapshot
    else:
        rule:
            name:
                f"snapshot_stage_{name}"
            input:
                files
            # ... + some implementation to save a snapshot

This has the behavior that it explicitly short circuits any rules that have staged outputs, but it does require explicit buy in. I think it might be worth having a special stage that always tracks the current version of the document artifacts (similar to the Overleaf document), since it would be great to always have access to a "draft" build. In that case, perhaps we'd want to allow missing figures.

Dan Foreman-Mackey · Answer 1 · Tue May 09 2023 07:21:03 GMT+0800 (China Standard Time)

As an update: I've just released an initial draft of this as an external package that the next SYW can depend on: https://github.com/dfm/snakemake-staging

cc @katiebreivik, @adrn, @jiayindong