MD5 Checksum failure

Question

MD5 Checksum failure

JonathanJao opened this issue 2 years ago · comments

Hi, so after resolving the encoding issue, I'm still getting a few errors with the most recent code on the following datasets:

Please try the following tasks later by running individual files: ['multi_news.py', 'reddit_tifu.py', 'search_qa.py', 'amazon_polarity.py', 'spider.py', 'jeopardy.py', 'gigaword.py', 'wiki_auto.py', 'wiki_bio.py', 'yahoo_answers_topics.py', 'yelp_review_full.py', 'dbpedia_14.py', 'definite_pronoun_resolution.py', 'kilt_wow.py']

When I try to run a few of them, they output the following:

(crossfit) > python multi_news.py
Using custom data configuration default
Downloading and preparing dataset multi_news/default (download: 245.06 MiB, generated: 667.74 MiB, post-processed: Unknown size, total: 912.80 MiB) to /home/ABCD/.cache/huggingface/datasets/multi_news/default/1.0.0/465b14e19b4d6a55c9bb9131ca1de642175872143c9b231bee1dce789311b449...
Traceback (most recent call last):
  File "multi_news.py", line 32, in <module>
    main()
  File "multi_news.py", line 29, in main
    train, dev, test = dataset.generate_k_shot_data(k=32, seed=seed, path="../data/")
  File "/scratch/ABCD/CrossFit/tasks/fewshot_gym_dataset.py", line 79, in generate_k_shot_data
    dataset = self.load_dataset()
  File "multi_news.py", line 23, in load_dataset
    return datasets.load_dataset('multi_news')
  File "/ext3/miniconda3/envs/crossfit/lib/python3.6/site-packages/datasets/load.py", line 746, in load_dataset
    use_auth_token=use_auth_token,
  File "/ext3/miniconda3/envs/crossfit/lib/python3.6/site-packages/datasets/builder.py", line 579, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/ext3/miniconda3/envs/crossfit/lib/python3.6/site-packages/datasets/builder.py", line 639, in _download_and_prepare
    self.info.download_checksums, dl_manager.get_recorded_sizes_checksums(), "dataset source files"
  File "/ext3/miniconda3/envs/crossfit/lib/python3.6/site-packages/datasets/utils/info_utils.py", line 39, in verify_checksums
    raise NonMatchingChecksumError(error_msg + str(bad_urls))
datasets.utils.info_utils.NonMatchingChecksumError: Checksums didn't match for dataset source files:
['https://drive.google.com/uc?export=download&id=1vRY2wM6rlOZrf9exGTm5pXj5ExlVwJ0C']

Running a curl on the URL yields:

<html lang=en><meta charset=utf-8><meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width"><title>Error 400 (Bad Request)!!1</title><style nonce="SpqF3pAZ+9nngUOG9GU6Gg">*{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{color:#222;text-align:unset;margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px;}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}pre{white-space:pre-wrap;}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}</style><main id="af-error-container" role="main"><a href=//www.google.com><span id=logo aria-label=Google role=img></span></a><p><b>400.</b> <ins>That’s an error.</ins><p>The server cannot process the request because it is malformed. It should not be retried. <ins>That’s all we know.</ins></main>

On the README file it says that Google Drive has a quota for daily download, but this error message looks like there may be something else going on.

Qinyuan Ye · Answer 1 · Fri Apr 29 2022 09:24:30 GMT+0800 (China Standard Time)

Hi @JonathanJao

Thanks for raising this issue!

For these tasks, please try adding ignore_verifications=True to the load_dataset function. E.g., dataset = load_dataset("kilt_tasks", "wow", ignore_verifications=True). This will skip the checksum verification phase during dataset loading.

We suspect that some dataset owners have updated their files, and this makes the checksum in huggingface datasets outdated. (See #2) Unfortunately we don't have control over this. You may get data samples that are slightly different from our original paper, but we expect the impact of this to be small.

Let me know if some tasks are still not working / missing.

-Qinyuan