iterative / dvc

🦉 ML Experiments and Data Management with Git

Home Page:https://dvc.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

API for updating hash and size in dvc.lock file due to changes that'd have no effect on dvc DAG

adamliter opened this issue · comments

Something I occasionally find myself doing is manually updating my dvc.lock file if I've made a change to a file that is a dependency of some stage but the change I made wouldn't actually impact the outputs of any of the DVC stages. When this happens, I usually manually update the MD5 hash and size of the file in the dvc.lock file and then make a commit explaining what I did, and why it wouldn't actually result in a change to any of the stages.

Some example scenarios:

  • Adding a logger to a training script to periodically outputs training loss (this means needing to modify train.py, which we'd definitely want to be in the deps section for the DVC stage called train, but just adding some logging shouldn't invalidate the current outputs).
  • Needing to migrate from one database schema to another because of database migrations/concerns outside of our control, but the new database has the exact same data. Updating a SQL query to use the new schema will make our DVC fetch_data stage go out of date since this .sql file will be in the deps section for that stage, which would, in all other circumstances, be a good thing. But this migration has no change on the data and is outside of our control.
  • etc.

Those are just a few scenarios where I've found myself manually updating a dvc.lock file. Even though I'd generally want changes to the files I'm tracking as deps of certain stages to result in this behavior, there are some cases where I know the changes I've made to a deps-tracked file should effectively be no-op changes. The feature request is to expose some API for updating the dvc.lock in such cases instead of having to do it manually.

Do you think this is something you'd be open to? Maybe there are some pitfalls I'm not seeing or thinking through. If you're open to it, I'm not sure what a good name for the API would be. Maybe something like dvc update-lock that takes a filename as an argument and then replaces the hash and size for that file with its new hash and size anywhere it is found in the dvc.lock file ... ?

Have you tried dvc commit? It force synchronizes your workspace to dvc.lock without actually running the stages.

@skshetry Ah, thank you! I misunderstood the help text of dvc commit. It says "Record changes [...] by storing the current versions in the cache." So I assumed this was for only updating things in .dvc/cache and didn't even think to try it.

But you're right, this does exactly what I want. Thank you.

Edit: It looks like the documentation on the website is much clearer about the use cases, but I did not check there.

The help message needs to be updated. Looking at the git blame, it hasn't been updated for four years. 😄

COMMIT_HELP = (
"Record changes to files or directories tracked by DVC"
" by storing the current versions in the cache."
)

@skshetry Do you think I should reopen this as a request to update the help message for dvc commit?