showyourwork / showyourwork

A workflow for reproducible and open scientific articles

Home Page:https://show-your.work

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Zenodo sandbox entries named after jobs instead of files

afarah18 opened this issue · comments

Hello, hoping I could get some advice on using dynamic datasets with zenodo sandbox.

Summary
I've set up zenodo sandbox caching for two of my intermediate results, one of which is very expensive to generate. While it seems like my zenodo sandbox authentication is working, and the sandbox entries are created, SYW can never find the correct files in those entries. I suspect it might be because the sandbox entries are named according to the job name as specified by the Snakefile, but SYW looks for the file names. I can't tell if I'm using one where I should be using the other, or if there is a bug.

Some more details
My Snakefile contains the following:

rule datagen:
    output:
        directory("src/data/gw_data")
    cache:
        True
    input:
        "src/data/optimal_snr_aplus_design_O5.h5"
    script:
        "src/scripts/data_generation.py"
rule nonparinference:
    output:
        "src/data/mcmc_nonparametric.nc4"
    cache:
        True
    input:
        "src/data/gw_data"
    script:
        "src/scripts/nonparametric_inference.py"
rule nonparplots:
    output:
        "src/tex/figures/O5_GP.pdf"
    input:
        "src/data/mcmc_nonparametric.nc4"
    script:
        "src/scripts/nonparametric_twopanel.py"

When I run showyourwork build, after showyourwork clean I get

...
Running user rule datagen...
Searching remote file cache: src/data/gw_data...

which tells me that it is successfully going to the zenodo sandbox entry. However, even after I have run this several times, I always get

File not found on remote cache. See logs for details.
Running rule from scratch...
Caching output file on remote: src/data/gw_data...

This happens for both of my intermediate results. When I look at the sandbox entry that was created, it shows files with the name of the jobs I specified in the Snakefile, not the outputs of those jobs (see below screenshot)
image

The showyourwork.log says this:

Attempting to access Zenodo Sandbox deposit with DOI 10.5072/zenodo.23379...
Searching for file `datagen` with hash `b324a29d7f46bbb604af71a59d0cee465693186cc3c2c674883ee525ad13750e`...
File not found on remote cache. See logs for details.
list indices must be integers or slices, not str
Running rule from scratch...
Caching output file on remote: src/data/gw_data...

I'm confused since it says its looking for datagen (The name of the job, and the name of the entry on sandbox) but it does not find it, despite the sandbox entry having that name.

Potentially relevant info:

  • I am now using the github (i.e. post-release) version of SYW so that I could get the bugfix described in #406 , so please let me know if there is a known issue with this version!
  • My repo is here: https://github.com/afarah18/spectral-sirens-with-GPs . The most expensive step (in nonparametric_inference.py) is very truncated so that it runs in a few minutes since my caching is not working.
  • I also set up downloading from a static zenodo dataset, and that works just fine (after the fix in #406):
User is not authenticated to edit 10.5281/zenodo.8428643.
Downloading src/data/optimal_snr_aplus_design_O5.h5 from Zenodo...
########################################################## 100.0%

Thank you in advance for the help!