Label handling commit breaks the imdb finetuning script
prrao87 opened this issue · comments
Thomas, thanks for sharing this code! I noticed that commit 8d9c237 seems to have broken the default functioning of the classification finetuning scripts - in the previous version there seems to have been a key called 'labels'
associated with the imdb and trec dictionaries, but in finetuning_train.py
this line still references the now deleted key.
I updated the line to just use DATASETS_LABELS_URL['imdb']['test']
as intended, but then it seems that the S3 bucket doesn't have the IMDB test file.
See below:
file_path = "https://s3.amazonaws.com/datasets.huggingface.co/imdb/test.labels.txt"
label_file = cached_path(file_path)
with open(label_file, "r", encoding="utf-8") as f:
all_lines = f.readlines()
print(all_lines[:5])
Gives:
['<?xml version="1.0" encoding="UTF-8"?>\n', '<Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>imdb/test.labels.txt</Key><RequestId>3D9E7C511167A0FB</RequestId><HostId>RiidOcrHfFaqxW9tmUXRppE/G3lsYoCZcq+uaYDi2yPPoe8mv/Og6PMuUncwk+B53tGsvcCZMWk=</HostId></Error>']
Does the test file for IMDB still exist with this name? This doesn't seem to be an issue with TREC.