iterative / dvc

🦉 ML Experiments and Data Management with Git

Home Page:https://dvc.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

pull: "Fetching" step takes forever

zhf231298 opened this issue · comments

pull: "Fetching" takes forever

Description

Since the update to the version 3.45, dvc pull started to spend a massive amount of time for "Fetching".
Can't tell precisely what is the reason, but at least the computation of the md5 of a large file is done repetitively within different dvc pull executions, even though it is stated that the computation is done only once.

Reproduce

  1. dvc pull

Expected

The "Fetching" should last very short, which is the situation that I have from another device where DVC 3.38.1 is being used.

Environment information

Problematic environment:

  • OS: macOS Sonoma 14.3
  • DVC: 3.45.0 (brew)
  • Remote storage: S3 bucket

Properly working environment:

  • OS: Ubuntu 22.04.3 LTS
  • DVC: 3.38.1 (pip)
  • Remote storage: S3 bucket (the same of before)

Output of dvc doctor:

$ dvc doctor
DVC version: 3.45.0 (brew)
--------------------------
Platform: Python 3.12.2 on macOS-14.3-arm64-arm-64bit
Subprojects:
	dvc_data = 3.13.0
	dvc_objects = 5.0.0
	dvc_render = 1.0.1
	dvc_task = 0.3.0
	scmrepo = 3.1.0
Supports:
	azure (adlfs = 2024.2.0, knack = 0.11.0, azure-identity = 1.15.0),
	gdrive (pydrive2 = 1.19.0),
	gs (gcsfs = 2024.2.0),
	http (aiohttp = 3.9.3, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.9.3, aiohttp-retry = 2.8.3),
	oss (ossfs = 2023.12.0),
	s3 (s3fs = 2024.2.0, boto3 = 1.34.34),
	ssh (sshfs = 2023.10.0),
	webdav (webdav4 = 0.9.8),
	webdavs (webdav4 = 0.9.8),
	webhdfs (fsspec = 2024.2.0)
Config:
	Global: /Users/zhf231298/Library/Application Support/dvc
	System: /opt/homebrew/share/dvc

Could you also share dvc config -l?

Could you also share dvc config -l?

Sure, the output of dvc config -l is:

remote.s3-bucket.url=s3://bucket-name
remote.s3-bucket.version_aware=true
core.autostage=true
core.remote=s3-bucket

The bucket name here has been substituted by a dummy name.

Confirmed this is slow for version-aware remotes, although it seems like cache remotes are not impacted.