run-cache: cache stage runs with no dependencies?
skshetry opened this issue · comments
stages:
stage1:
cmd: echo foo > foo
outs:
- foo
Let's say, if we have above stage, with no dependencies and an output. When I run it and rerun it again, it says
$ dvc repro
Running stage 'stage1':
> echo foo > foo
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'
To track the changes with git, run:
git add dvc.lock .gitignore
To enable auto staging, run:
dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.
Stage 'stage1' didn't change, skipping
$ dvc repro
Stage 'stage1' didn't change, skipping
Data and pipelines are up to date.
But if I have a lock file missing or stage name changed, it will force a rerun.
Ideally, run-cache is supposed to prevent this scenario, but it does not work for a stage without any dependencies. Should it cache those kinds of stages?
cc @efiop
Related: https://iterativeai.slack.com/archives/C044738NACC/p1706207735608469
Related to #6718. Do we have a good reason not to add no-dep stages to the run-cache? With all the templating we have now, it seems more important to handle these cases since the templated values may take the place of actual dependencies.
Related to #6718. Do we have a good reason not to add no-dep stages to the run-cache? With all the templating we have now, it seems more important to handle these cases since the templated values may take the place of actual dependencies.
This no-deps stages which are invalidated (due to no dvc.lock entry, or change in name due to templating) are an edge case. It can be an annoyance for sure, but technically still valid to run. In fact, we used to always run these kinds of stages until 2.0 where it was a breaking change (#5187).
Do we have a good reason not to add no-dep stages to the run-cache?
There's a risk that run-cache will check out to a very old state. Since there are no dependencies to match, there can be many "valid" states from older runs.
No-deps stages are the first stages to run in the pipeline and are usually used to download from remotes. They act as a trigger for the downstream stages. Using a very old state might affect the whole pipeline.
No-deps stages are the first stages to run in the pipeline and are usually used to download from remotes. They act as a trigger for the downstream stages. Using a very old state might affect the whole pipeline.
Sorry, I don't follow what type of stage you have in mind. Could you show an example?
I think we under-utilize the run-cache, and I have talked to a few people who intuitively expect the run-cache to always work since they think "I ran this before, so DVC should know not to run it again." We have an easy solution if users always want to run it, but no solution for people who want to use the run-cache here.
Sorry, I don't follow what type of stage you have in mind. Could you show an example?
stages:
load_data:
cmd:
- wget https://example.com/raw.csv
outs:
- raw.csv
extract_data:
cmd: python extract_data.py
deps:
- raw.csv
outs:
- train.csv
- test.csv
train:
cmd: python train.py
params:
- train
deps:
- train.csv
outs:
- model.joblib
evaluate:
cmd: python evaluate.py
params:
- evaluate
deps:
- test.csv
- train.csv
- model.joblib
metrics:
- reference.json
flowchart TD
node1["evaluate"]
node2["extract_data"]
node3["load_data"]
node4["train"]
node2-->node1
node2-->node4
node3-->node2
node4-->node1
Why not include https://example.com/raw.csv
in deps
or use import-url
? I don't see why this scenario should not have any deps.
Why not include
https://example.com/raw.csv
indeps
or useimport-url
? I don't see why this scenario should not have any deps.
The pipeline is very simple and does the job. External deps
and import-url
are additional concepts to learn.
Besides, this example is "inspired" from a recent tutorial, but I have seen a lot of dvc.yaml like this.
You can find similar examples like this through GitHub Search:
https://github.com/search?q=path%3A**%2Fdvc.yaml+cmd+wget+OR+curl&type=code
Thanks, look like this is indeed common.
There's a risk that run-cache will check out to a very old state. Since there are no dependencies to match, there can be many "valid" states from older runs.
Still not sure I agree with this concern, though. In the examples I see, it looks like it's a static dataset and expected to only run once, and enabling the run-cache makes it more likely that it is only run once.