mlfoundations / datacomp

DataComp: In search of the next generation of multimodal datasets

Home Page:http://datacomp.ai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Metadata download error - OSError: Consistency check failed

ch-shin opened this issue · comments

Hi, team!

I am trying to download the medium-scale dataset of the filtering track, but I keep failing with the following error.

OSError: Consistency check failed: file should be of size 122218957 but has size 56690589 ((…)f11adbfc933c.parquet).
We are sorry for the inconvenience. Please retry download and pass `force_download=True, resume_download=False` as argument.
If the issue persists, please let us know by opening an issue on https://github.com/huggingface/huggingface_hub.

It seems related to this issue huggingface/huggingface_hub#1498
Is there any bypass for downloading metadata, without using huggingface_hub?
Thanks.

Hi @ch-shin, do you also get this error with force_download=True, resume_download=False? In the issue you linked it seems this could also be due to running out of storage, do you have enough? Alternatively, have you tried using snapshot_download?

  • It looks like snapshot_download is used in download_upstream.py by default, right?
  • I got the same error with force_download=True, resume_download=False as input arguments in snapshot_download.
  • Yes, storage is enough.

Oh, I found they fixed the force_download flag very recently (huggingface/huggingface_hub#1549 (comment)). I will check it out and let you know how it goes 😇.

Hi @ch-shin sorry you're experiencing this issue. Maintainer of huggingface_hub here. Which version of huggingface_hub are you using? If the error is still happening, it would be good to update to latest release (0.16.4) and retry. To be honest, we are actively tracking down this issue but we haven't got a reliable way to trigger it which makes it very hard to debug (I personally never experienced it, even after a lot of attempts 😕)

@Wauplin Hi! Thank you for the follow-up on this. I updated it to 0.17.0.dev0 and still got the same error. And if I put force_download=True, resume_download=False option, I get the following error.

ValueError(
                "We have no connection or you passed local_files_only, so force_download is not an accepted option."
            )

from https://github.com/huggingface/huggingface_hub/blob/2940a65b22e9552b0dd40f0b61f502f66896d46d/src/huggingface_hub/file_download.py#L1253
I guess it happens when network bandwidth is not enough while downloading big files, losing etag. (but somehow proceed with some exception handlings, and then later make consistency check failure? I don't know 😇 )

@ch-shin Thanks for your feedback. Would you have time for another test? If possible, can you install huggingface_hub from this PR (huggingface/huggingface_hub#1561). It will not solve the error but the stacktrace will be more furnished.

To install it:

pip install githttps://github.com/jiamings/huggingface_hub@main

Then retry your failing script (btw, which file from which repo are you downloading?) and copy-paste the full error stacktrace printed in your terminal. Both with and without force_download. Thanks a lot in advance!

@Wauplin Sorry that I missed your comment 😓. Actually, I just upgraded my internet (25mbps --> 500mbps) and the problem has gone.

we are also experiencing this bug in our company and have huggingface_hub 0.16.4 installed