dmlc / gluon-nlp

NLP made easy

Home Page:https://nlp.gluon.ai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Loading 'distilbert_6_768_12' is broken

craffel opened this issue · comments

Description

The example code at https://nlp.gluon.ai/model_zoo/bert/index.html for the DistilBERT model produces an exception at HEAD.

Error Message

---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<ipython-input-1-fbb321631ad8> in <module>()
      2 get_ipython().system('pip install mxnet')
      3 import gluonnlp as nlp; import mxnet as mx;
----> 4 model, vocab = nlp.model.get_model('distilbert_6_768_12', dataset_name='distil_book_corpus_wiki_en_uncased')

5 frames

/usr/local/lib/python3.7/dist-packages/gluonnlp/model/__init__.py in get_model(name, **kwargs)
    154             'Model %s is not supported. Available options are\n\t%s'%(
    155                 name, '\n\t'.join(sorted(models.keys()))))
--> 156     return models[name](**kwargs)

/usr/local/lib/python3.7/dist-packages/gluonnlp/model/bert.py in distilbert_6_768_12(dataset_name, vocab, pretrained, ctx, output_attention, output_all_encodings, root, hparam_allow_override, **kwargs)
   1311 
   1312     from ..vocab import Vocab  # pylint: disable=import-outside-toplevel
-> 1313     bert_vocab = _load_vocab(dataset_name, vocab, root, cls=Vocab)
   1314     # DistilBERT
   1315     net = DistilBERTModel(encoder, len(bert_vocab),

/usr/local/lib/python3.7/dist-packages/gluonnlp/model/utils.py in _load_vocab(dataset_name, vocab, root, cls)
    269                           'Loading vocab based on dataset_name. '
    270                           'Input "vocab" argument will be ignored.')
--> 271         vocab = _load_pretrained_vocab(dataset_name, root, cls)
    272     else:
    273         assert vocab is not None, 'Must specify vocab if not loading from predefined datasets.'

/usr/local/lib/python3.7/dist-packages/gluonnlp/data/utils.py in _load_pretrained_vocab(name, root, cls)
    387         Loaded vocabulary object and Tokenizer for the pre-trained model.
    388     """
--> 389     file_name, file_ext, sha1_hash, special_tokens = _get_vocab_tokenizer_info(name, root)
    390     file_path = os.path.join(root, file_name + file_ext)
    391     if os.path.exists(file_path):

/usr/local/lib/python3.7/dist-packages/gluonnlp/data/utils.py in _get_vocab_tokenizer_info(name, root)
    346 def _get_vocab_tokenizer_info(name, root):
    347     file_name = '{name}-{short_hash}'.format(name=name,
--> 348                                              short_hash=short_hash(name))
    349     root = os.path.expanduser(root)
    350     sha1_hash, file_ext, special_tokens = _vocab_sha1[name]

/usr/local/lib/python3.7/dist-packages/gluonnlp/data/utils.py in short_hash(name)
    340         raise ValueError('Vocabulary for {name} is not available. '
    341                          'Hosted vocabularies include: {vocabs}'.format(name=name,
--> 342                                                                         vocabs=vocabs))
    343     return _vocab_sha1[name][0][:8]
    344 

ValueError: Vocabulary for distil_book_corpus_wiki_en_uncased is not available. Hosted vocabularies include: ['wikitext-2', 'gbw', 'WMT2014_src', 'WMT2014_tgt', 'book_corpus_wiki_en_cased', 'book_corpus_wiki_en_uncased', 'openwebtext_book_corpus_wiki_en_uncased', 'openwebtext_ccnews_stories_books_cased', 'wiki_multilingual_cased', 'distilbert_book_corpus_wiki_en_uncased', 'wiki_cn_cased', 'wiki_multilingual_uncased', 'scibert_scivocab_uncased', 'scibert_scivocab_cased', 'scibert_basevocab_uncased', 'scibert_basevocab_cased', 'biobert_v1.0_pmc_cased', 'biobert_v1.0_pubmed_cased', 'biobert_v1.0_pubmed_pmc_cased', 'biobert_v1.1_pubmed_cased', 'clinicalbert_uncased', 'baidu_ernie_uncased', 'openai_webtext', 'xlnet_126gb', 'kobert_news_wiki_ko_cased']

To Reproduce

Here is a colab: https://colab.research.google.com/drive/1PhShfNvXWQIzPbBiSZwo3uwfNzv2n0UJ?usp=sharing
It is as simple as

!pip install gluonnlp
!pip install mxnet
import gluonnlp as nlp; import mxnet as mx;
model, vocab = nlp.model.get_model('distilbert_6_768_12', dataset_name='distil_book_corpus_wiki_en_uncased')

Steps to reproduce

(Paste the commands you ran that produced the error.)

!pip install gluonnlp
!pip install mxnet
import gluonnlp as nlp; import mxnet as mx;
model, vocab = nlp.model.get_model('distilbert_6_768_12', dataset_name='distil_book_corpus_wiki_en_uncased')

What have you tried to solve it?

  1. I tried other models, they worked.
  2. I tried replacing the dataset_name with book_corpus_wiki_en_uncased which did not work

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

# paste outputs here

This script (https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py) does not exist, it gives a 404.

@craffel thanks for reporting. The above PRs should fix the problem. The correct dataset name is distilbert_book_corpus_wiki_en_uncased

Thanks.