run-cache: cache stage runs with no dependencies?

Question

run-cache: cache stage runs with no dependencies?

skshetry opened this issue 4 months ago · comments

stages:
  stage1:
    cmd: echo foo > foo
    outs:
    - foo

Let's say, if we have above stage, with no dependencies and an output. When I run it and rerun it again, it says

$ dvc repro
Running stage 'stage1':
> echo foo > foo
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'

To track the changes with git, run:

        git add dvc.lock .gitignore

To enable auto staging, run:

        dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.
Stage 'stage1' didn't change, skipping

$ dvc repro
Stage 'stage1' didn't change, skipping
Data and pipelines are up to date.

But if I have a lock file missing or stage name changed, it will force a rerun.
Ideally, run-cache is supposed to prevent this scenario, but it does not work for a stage without any dependencies. Should it cache those kinds of stages?

cc @efiop

Related: https://iterativeai.slack.com/archives/C044738NACC/p1706207735608469

Dave Berenbaum · Answer 1 · Wed Jan 31 2024 03:28:43 GMT+0800 (China Standard Time)

Related to #6718. Do we have a good reason not to add no-dep stages to the run-cache? With all the templating we have now, it seems more important to handle these cases since the templated values may take the place of actual dependencies.

skshetry · Answer 2 · Wed Jan 31 2024 13:43:41 GMT+0800 (China Standard Time)

Related to #6718. Do we have a good reason not to add no-dep stages to the run-cache? With all the templating we have now, it seems more important to handle these cases since the templated values may take the place of actual dependencies.

This no-deps stages which are invalidated (due to no dvc.lock entry, or change in name due to templating) are an edge case. It can be an annoyance for sure, but technically still valid to run. In fact, we used to always run these kinds of stages until 2.0 where it was a breaking change (#5187).

Do we have a good reason not to add no-dep stages to the run-cache?

There's a risk that run-cache will check out to a very old state. Since there are no dependencies to match, there can be many "valid" states from older runs.

No-deps stages are the first stages to run in the pipeline and are usually used to download from remotes. They act as a trigger for the downstream stages. Using a very old state might affect the whole pipeline.

Dave Berenbaum · Answer 3 · Wed Jan 31 2024 21:18:03 GMT+0800 (China Standard Time)

No-deps stages are the first stages to run in the pipeline and are usually used to download from remotes. They act as a trigger for the downstream stages. Using a very old state might affect the whole pipeline.

Sorry, I don't follow what type of stage you have in mind. Could you show an example?

I think we under-utilize the run-cache, and I have talked to a few people who intuitively expect the run-cache to always work since they think "I ran this before, so DVC should know not to run it again." We have an easy solution if users always want to run it, but no solution for people who want to use the run-cache here.

skshetry · Answer 4 · Wed Jan 31 2024 23:56:20 GMT+0800 (China Standard Time)

Sorry, I don't follow what type of stage you have in mind. Could you show an example?

stages:
  load_data:
    cmd: 
    - wget https://example.com/raw.csv
    outs:
    - raw.csv
 
  extract_data:
    cmd: python extract_data.py
    deps:
    - raw.csv
    outs:
    - train.csv
    - test.csv

  train:
    cmd: python train.py
    params:
    - train
    deps:
    - train.csv
    outs:
    - model.joblib

  evaluate:
    cmd: python evaluate.py
    params:
    - evaluate
    deps:
    - test.csv
    - train.csv
    - model.joblib
    metrics:
    - reference.json

flowchart TD
        node1["evaluate"]
        node2["extract_data"]
        node3["load_data"]
        node4["train"]
        node2-->node1
        node2-->node4
        node3-->node2
        node4-->node1

Dave Berenbaum · Answer 5 · Thu Feb 01 2024 00:07:06 GMT+0800 (China Standard Time)

Why not include https://example.com/raw.csv in deps or use import-url? I don't see why this scenario should not have any deps.

skshetry · Answer 6 · Thu Feb 01 2024 00:15:32 GMT+0800 (China Standard Time)

Why not include https://example.com/raw.csv in deps or use import-url? I don't see why this scenario should not have any deps.

The pipeline is very simple and does the job. External deps and import-url are additional concepts to learn.

Besides, this example is "inspired" from a recent tutorial, but I have seen a lot of dvc.yaml like this.

https://github.com/iterative/evidently-dvc/blob/f0ed5c0f526c9eaf2b5dde57d500abc08d063614/pipelines/train/dvc.yaml

You can find similar examples like this through GitHub Search:

https://github.com/search?q=path%3A**%2Fdvc.yaml+cmd+wget+OR+curl&type=code

Dave Berenbaum · Answer 7 · Thu Feb 01 2024 00:51:59 GMT+0800 (China Standard Time)

Thanks, look like this is indeed common.

There's a risk that run-cache will check out to a very old state. Since there are no dependencies to match, there can be many "valid" states from older runs.

Still not sure I agree with this concern, though. In the examples I see, it looks like it's a static dataset and expected to only run once, and enabling the run-cache makes it more likely that it is only run once.