showyourwork / showyourwork

A workflow for reproducible and open scientific articles

Home Page:https://show-your.work

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Ideas for implementing a "staged" build process

dfm opened this issue · comments

Something that came up a few different times in the hack day yesterday was that it would be nice to have a "staging" (working title) feature, where users could explicitly choose to save or restore from a granular cache system. This could be one approach to something like the "draft" mode discussed in #298, or a next step for #314.

So as not to forget, I wanted to sketch the idea (+ some implementation thoughts) here. The general idea is motivated by the fact that a lot of hack day participants said that they find the caching workflow somewhat counterintuitive and that a barrier to entry is the long (and sometimes unpredictable) run times when using SYW with large datasets or during fast writing time, where you don't care so much about the figure precision. Our proposed solution was to collect steps of the build process into "stages" that can be saved as snapshots (to Zenodo, or even an orphan Git branch, depending on the use case) and then explicitly restored at build time via a command line interface.

The stages that we proposed were:

  1. Script dependencies: for example, external or intermediate data files
  2. Generated artifacts: figures and other resources (TeX tables, etc.) required to build the document

But, we would like to implement a feature that could support different structure.

Then the interface could be something like:

showyourwork snapshot "artifacts"

to save a snapshot of the "artifacts" stage, and then

showyourwork build --restore="artifacts"

to run a build where the artifacts are restored from the most recent snapshot and the build proceeds using those figures or fails with an error, rather than trying to generate the missing artifacts. (The specifics of the interface are very much up for discussion!)

Inside Snakemake, a possible interface would be something like:

rule generate_a_figure:
    output:
        staged("path/to/figure.pdf", stage="artifacts")

where staged has the following implementation:

STAGES = {}
def staged(*files, stage="default"):
    STAGES[stage] = STAGES.get(stage, [])
    STAGES[stage].extend(files)
    if stage in RESTORE_STAGE:
        return []
    else:
        return list(files)

Where RESTORE_STAGE is a list or set of stage names to restore, constructed from the configuration settings. Then, after all rules that use staged are defined, we'd need to add some other rules to restore from the cache:

for name, files in STAGES.items():
    if name in RESTORE_STAGE:
        rule:
            name:
                f"restore_stage_{name}"
            output:
                files
            # ... + some implementation to restore from a snapshot
    else:
        rule:
            name:
                f"snapshot_stage_{name}"
            input:
                files
            # ... + some implementation to save a snapshot

This has the behavior that it explicitly short circuits any rules that have staged outputs, but it does require explicit buy in. I think it might be worth having a special stage that always tracks the current version of the document artifacts (similar to the Overleaf document), since it would be great to always have access to a "draft" build. In that case, perhaps we'd want to allow missing figures.

As an update: I've just released an initial draft of this as an external package that the next SYW can depend on: https://github.com/dfm/snakemake-staging

cc @katiebreivik, @adrn, @jiayindong