huggingface / naacl_transfer_learning_tutorial

Repository of code for the tutorial on Transfer Learning in NLP held at NAACL 2019 in Minneapolis, MN, USA

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Label handling commit breaks the imdb finetuning script

prrao87 opened this issue · comments

Thomas, thanks for sharing this code! I noticed that commit 8d9c237 seems to have broken the default functioning of the classification finetuning scripts - in the previous version there seems to have been a key called 'labels' associated with the imdb and trec dictionaries, but in finetuning_train.py this line still references the now deleted key.

I updated the line to just use DATASETS_LABELS_URL['imdb']['test'] as intended, but then it seems that the S3 bucket doesn't have the IMDB test file.

See below:

file_path = "https://s3.amazonaws.com/datasets.huggingface.co/imdb/test.labels.txt"
label_file = cached_path(file_path)
with open(label_file, "r", encoding="utf-8") as f:
    all_lines = f.readlines()
    print(all_lines[:5])

Gives:

['<?xml version="1.0" encoding="UTF-8"?>\n', '<Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>imdb/test.labels.txt</Key><RequestId>3D9E7C511167A0FB</RequestId><HostId>RiidOcrHfFaqxW9tmUXRppE/G3lsYoCZcq+uaYDi2yPPoe8mv/Og6PMuUncwk+B53tGsvcCZMWk=</HostId></Error>']

Does the test file for IMDB still exist with this name? This doesn't seem to be an issue with TREC.