Loading 'distilbert_6_768_12' is broken
craffel opened this issue · comments
Description
The example code at https://nlp.gluon.ai/model_zoo/bert/index.html for the DistilBERT model produces an exception at HEAD.
Error Message
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-1-fbb321631ad8> in <module>()
2 get_ipython().system('pip install mxnet')
3 import gluonnlp as nlp; import mxnet as mx;
----> 4 model, vocab = nlp.model.get_model('distilbert_6_768_12', dataset_name='distil_book_corpus_wiki_en_uncased')
5 frames
/usr/local/lib/python3.7/dist-packages/gluonnlp/model/__init__.py in get_model(name, **kwargs)
154 'Model %s is not supported. Available options are\n\t%s'%(
155 name, '\n\t'.join(sorted(models.keys()))))
--> 156 return models[name](**kwargs)
/usr/local/lib/python3.7/dist-packages/gluonnlp/model/bert.py in distilbert_6_768_12(dataset_name, vocab, pretrained, ctx, output_attention, output_all_encodings, root, hparam_allow_override, **kwargs)
1311
1312 from ..vocab import Vocab # pylint: disable=import-outside-toplevel
-> 1313 bert_vocab = _load_vocab(dataset_name, vocab, root, cls=Vocab)
1314 # DistilBERT
1315 net = DistilBERTModel(encoder, len(bert_vocab),
/usr/local/lib/python3.7/dist-packages/gluonnlp/model/utils.py in _load_vocab(dataset_name, vocab, root, cls)
269 'Loading vocab based on dataset_name. '
270 'Input "vocab" argument will be ignored.')
--> 271 vocab = _load_pretrained_vocab(dataset_name, root, cls)
272 else:
273 assert vocab is not None, 'Must specify vocab if not loading from predefined datasets.'
/usr/local/lib/python3.7/dist-packages/gluonnlp/data/utils.py in _load_pretrained_vocab(name, root, cls)
387 Loaded vocabulary object and Tokenizer for the pre-trained model.
388 """
--> 389 file_name, file_ext, sha1_hash, special_tokens = _get_vocab_tokenizer_info(name, root)
390 file_path = os.path.join(root, file_name + file_ext)
391 if os.path.exists(file_path):
/usr/local/lib/python3.7/dist-packages/gluonnlp/data/utils.py in _get_vocab_tokenizer_info(name, root)
346 def _get_vocab_tokenizer_info(name, root):
347 file_name = '{name}-{short_hash}'.format(name=name,
--> 348 short_hash=short_hash(name))
349 root = os.path.expanduser(root)
350 sha1_hash, file_ext, special_tokens = _vocab_sha1[name]
/usr/local/lib/python3.7/dist-packages/gluonnlp/data/utils.py in short_hash(name)
340 raise ValueError('Vocabulary for {name} is not available. '
341 'Hosted vocabularies include: {vocabs}'.format(name=name,
--> 342 vocabs=vocabs))
343 return _vocab_sha1[name][0][:8]
344
ValueError: Vocabulary for distil_book_corpus_wiki_en_uncased is not available. Hosted vocabularies include: ['wikitext-2', 'gbw', 'WMT2014_src', 'WMT2014_tgt', 'book_corpus_wiki_en_cased', 'book_corpus_wiki_en_uncased', 'openwebtext_book_corpus_wiki_en_uncased', 'openwebtext_ccnews_stories_books_cased', 'wiki_multilingual_cased', 'distilbert_book_corpus_wiki_en_uncased', 'wiki_cn_cased', 'wiki_multilingual_uncased', 'scibert_scivocab_uncased', 'scibert_scivocab_cased', 'scibert_basevocab_uncased', 'scibert_basevocab_cased', 'biobert_v1.0_pmc_cased', 'biobert_v1.0_pubmed_cased', 'biobert_v1.0_pubmed_pmc_cased', 'biobert_v1.1_pubmed_cased', 'clinicalbert_uncased', 'baidu_ernie_uncased', 'openai_webtext', 'xlnet_126gb', 'kobert_news_wiki_ko_cased']
To Reproduce
Here is a colab: https://colab.research.google.com/drive/1PhShfNvXWQIzPbBiSZwo3uwfNzv2n0UJ?usp=sharing
It is as simple as
!pip install gluonnlp
!pip install mxnet
import gluonnlp as nlp; import mxnet as mx;
model, vocab = nlp.model.get_model('distilbert_6_768_12', dataset_name='distil_book_corpus_wiki_en_uncased')
Steps to reproduce
(Paste the commands you ran that produced the error.)
!pip install gluonnlp
!pip install mxnet
import gluonnlp as nlp; import mxnet as mx;
model, vocab = nlp.model.get_model('distilbert_6_768_12', dataset_name='distil_book_corpus_wiki_en_uncased')
What have you tried to solve it?
- I tried other models, they worked.
- I tried replacing the dataset_name with
book_corpus_wiki_en_uncased
which did not work
Environment
We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:
curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python
# paste outputs here
This script (https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py) does not exist, it gives a 404.
@craffel thanks for reporting. The above PRs should fix the problem. The correct dataset name is distilbert_book_corpus_wiki_en_uncased
Thanks.