dmlc / gluon-nlp

Description

While using tokenizers.create with the model and vocab file for a custom corpus, the code throws an error and is not able to generate the BERT vocab file

Error Message

ValueError: Mismatch vocabulary! All special tokens specified must be control tokens in the sentencepiece vocabulary.

To Reproduce

from gluonnlp.data import tokenizers
tokenizers.create('spm', model_path='lsw1/spm.model', vocab_path='lsw1/spm.vocab')

spm.zip

Actually I can load the model:

import gluonnlp
from gluonnlp.data.tokenizers import SentencepieceTokenizer
tokenizer = SentencepieceTokenizer(model_path='spm.model', vocab='spm.vocab')
print(tokenizer)

Output:

SentencepieceTokenizer(
   model_path = /home/ubuntu/spm.model
   lowercase = False, nbest = 0, alpha = 0.0
   vocab = Vocab(size=3500, unk_token="<unk>", bos_token="<s>", eos_token="</s>", pad_token="<pad>")
)

@preeyank5 Would you try again?

I find that the root cause is that we will need better error handling of the **kwargs here. Basically, the argument should be vocab instead of vocab_path and vocab_path has been put under **kwargs.

The way to fix the issue is to revise

gluon-nlp/src/gluonnlp/data/tokenizers/sentencepiece.py

Lines 99 to 101 in 08dc6ed

    
           for k, v in kwargs.items(): 
        
               if k in special_tokens_kv: 
        
                   if v != special_tokens_kv[k]:

Marked it as a "good first issue" because it's a good issue for early contributors. We can just ensure that the correct error is raised when kwargs contains unexpected values.

Thanks Xingjian, I am now able to load the model

Let's keep this issue to track the error message. We should raise the error if the user has specified some unexpected kwargs.

Hi, i am new to this Project and would like to tackle this issue

Hi, i am new to this Project and would like to tackle this issue

Have you Solved it yet

	for k, v in kwargs.items():
	if k in special_tokens_kv:
	if v != special_tokens_kv[k]:

[Error Message] Improve error message in SentencepieceTokenizer when arguments are not expected.

Description

Error Message

To Reproduce