Hashcode for test fold of Multi30k corrupt

Question

Hashcode for test fold of Multi30k corrupt

fmohr opened this issue a year ago · comments

🐛 Bug

Bug Description
When loading the test data via

multi_datapipe = Multi30k(split="test")

I get the following error (only occurs on test split). It seems that the hash currently associated with the tar file does not correspond to the one of the actual tar file on the server.

RuntimeError: The computed hash 0681be16a532912288a91ddd573594fbdd57c0fbb81486eff7c55247e35326c2 of ~/.cache/torch/text/datasets/Multi30k/mmt16_task1_test.tar.gz does not match the expectedhash 6d1ca1dba99e2c5dd54cae1226ff11c2551e6ce63527ebb072a1f70f72a5cd36. Delete the file manually and retry.

Needless to say, I deleted the file manually (in fact was deleted manually automatically by script).

Expected Behvior
I would this expect to work just as for split = "train" or split = "valid".

Environment
torchtext version is 0.14.1 (the environment collection script as left in the template is 404).

Nayef Ahmed · Answer 1 · Wed Apr 19 2023 04:38:39 GMT+0800 (China Standard Time)

Hey @fmohr. We actually updated the expected hash of the file alongside where the file is downloaded from in #2003. So the behavior you notice is actually correct since you had an outdated copy of the file downloaded in your cache. The expected resolution would be to delete the cached file manually?