API for updating hash and size in dvc.lock file due to changes that'd have no effect on dvc DAG
adamliter opened this issue · comments
Something I occasionally find myself doing is manually updating my dvc.lock
file if I've made a change to a file that is a dependency of some stage but the change I made wouldn't actually impact the outputs of any of the DVC stages. When this happens, I usually manually update the MD5 hash and size of the file in the dvc.lock
file and then make a commit explaining what I did, and why it wouldn't actually result in a change to any of the stages.
Some example scenarios:
- Adding a logger to a training script to periodically outputs training loss (this means needing to modify
train.py
, which we'd definitely want to be in thedeps
section for the DVC stage calledtrain
, but just adding some logging shouldn't invalidate the current outputs). - Needing to migrate from one database schema to another because of database migrations/concerns outside of our control, but the new database has the exact same data. Updating a SQL query to use the new schema will make our DVC
fetch_data
stage go out of date since this.sql
file will be in thedeps
section for that stage, which would, in all other circumstances, be a good thing. But this migration has no change on the data and is outside of our control. - etc.
Those are just a few scenarios where I've found myself manually updating a dvc.lock
file. Even though I'd generally want changes to the files I'm tracking as deps
of certain stages to result in this behavior, there are some cases where I know the changes I've made to a deps
-tracked file should effectively be no-op changes. The feature request is to expose some API for updating the dvc.lock
in such cases instead of having to do it manually.
Do you think this is something you'd be open to? Maybe there are some pitfalls I'm not seeing or thinking through. If you're open to it, I'm not sure what a good name for the API would be. Maybe something like dvc update-lock
that takes a filename as an argument and then replaces the hash and size for that file with its new hash and size anywhere it is found in the dvc.lock
file ... ?
Have you tried dvc commit
? It force synchronizes your workspace to dvc.lock
without actually running the stages.
@skshetry Ah, thank you! I misunderstood the help text of dvc commit
. It says "Record changes [...] by storing the current versions in the cache." So I assumed this was for only updating things in .dvc/cache
and didn't even think to try it.
But you're right, this does exactly what I want. Thank you.
Edit: It looks like the documentation on the website is much clearer about the use cases, but I did not check there.
The help message needs to be updated. Looking at the git blame
, it hasn't been updated for four years. 😄
Lines 31 to 34 in cf9855a
@skshetry Do you think I should reopen this as a request to update the help message for dvc commit
?