iterative / dvc

🦉 ML Experiments and Data Management with Git

Home Page:https://dvc.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

push/pull: Possible bug with individual file push and pull to Google Drive remote storage

d33bs opened this issue · comments

Bug Report

Thank you for the fantastic work with DVC! I noticed an odd behavior while working with individual files and trying to push + pull these in a reproducible way. I felt this might be a strange enough behavior to warrant a bug type issue creation, but I could also see how I might be "doing it wrong" and need guidance about this. If it's the case that I'm doing things entirely incorrect, I feel that documentation updates may be warranted to help avoid this issue in the future for others who may do or think about things in a similar way.

Description

I usually use the data versioning guide as a reference when getting going with DVC projects. As part of that guide it covers adding individual files and pushing them to a remote.

When I add multiple individual files using DVC, push them to a remote, then attempt to pull those same pushed files I notice that I don't get back what I think I should. I created the following example repo in an effort to reproduce the issues, thinking that it might also have something to do with DVC package or dependency versions (I don't recall this behavior in earlier versions, but I'm not certain when it might have began or if that's truly related here). It seemed like .gitignore's sometimes changed the behavior here in terms of what was "seen" or not, but it didn't seem to effect what was pushed or pulled.

As a workaround to the individual file addition inconsistencies I've found that adding the files as a directory seemed to work well. Adding a directory didn't appear in the data versioning guide guide as a preferred way of doing things, so it took some time to figure out what was happening. If the docs are updated I'd suggest providing strongly worded suggestions about DVC preferences on singular file vs directory additions (what's the better pattern to follow, or if there's no pattern / preference stating that openly).

Please see the following link for the example code: https://github.com/d33bs/demo-dvc-possible-push-bug
Note: use any Google Drive folder ID you have access to within the config (I don't wish to share my own in this case to avoid security challenges which may be associated with this).

Reproduce

  1. Clone repo at: https://github.com/d33bs/demo-dvc-possible-push-bug
  2. Update DVC config with Google Drive folder accessible by relevant Google Account you own
  3. Run poetry install
  4. Run poetry run poe dvc_possible_bug
  5. View results

Expected

I'd expect that individual files or directories act similarly when added, pushed, and pulled using DVC.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 3.43.1 (pip)
-------------------------
Platform: Python 3.11.2 on macOS
Subprojects:
	dvc_data = 3.9.0
	dvc_objects = 3.0.6
	dvc_render = 1.0.1
	dvc_task = 0.3.0
	scmrepo = 2.1.1
Supports:
	gdrive (pydrive2 = 1.19.0),
	http (aiohttp = 3.9.3, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.9.3, aiohttp-retry = 2.8.3)
Config:
	Global: /Users/username/Library/Application Support/dvc
	System: /Library/Application Support/dvc
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: gdrive

Additional Information (if any):

Thanks a ton for any help you may be able to provide, including suggestions towards best practices or errors in my approach!

@d33bs I don's see in the script dvc push command at all, is it expected? Also the dvc add data/data_sub_dir/zen.zip is duplicated.

In the remove data script you are also removing the .dvc file, it means dvc pull can't bring it back. data dir is not controlled by DVC, so what is the expected behavior in this case for you?

Thank you @shcheklein for the kind feedback and apologies for the earlier bugs. I've updated the repo just now based on your questions + comments. Despite this, I still seem unable to pull the files when they are added individually. Is there something else I'm possibly doing wrong? Thanks again for any guidance you can offer.

@d33bs good change.

I think, now you could also remove:

-/data
-!/data/*.dvc
-!/data/*/*.dvc

from the .gitignore. DVC takes care of that automatically, and it seems these tricky conditions are causing some troubles (not sure why tbh, but it becomes less important to solve).

Could you give it try please?

Thanks @shcheklein ! This seems to have allowed DVC to perform the correct actions! I've updated the repo accordingly.

Some follow-up questions / thoughts:

  • Does DVC impose the requirement of using nested .gitignore files? This is generally a pattern I personally avoid to help reduce the number of files for a project and provide a single place to look for .gitignore rules (generally the root of the project).
  • If there is a requirement that DVC uses nested .gitignore files, could I suggest this be made more prominent in the guidance documentation (for example, in the data versioning guide I linked with the issue)?
  • If this isn't a requirement, is it possible that there's a bug in the way DVC reads the rules you mentioned I should remove?

Thank you again for your continued help with this!

hey, sure.

I would check this response by @pmrowla to utilize a single .gitignore + the way it should look like in your case.