mesolitica / malaya

Natural Language Toolkit for Malaysian language, https://malaya.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

UnicodeDecodeError in Transformer

anglilian opened this issue · comments

I'm getting the following error:

`UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 117: character maps to

UnicodeDecodeError Traceback (most recent call last)
in
----> 1 malaya.sentiment.transformer(model = 'tiny-bert')

~\AppData\Local\Programs\Python\Python39\lib\site-packages\herpetologist_init_.py in check(*args, **kwargs)
98 nested_check(v, p)
99
--> 100 return func(*args, **kwargs)
101
102 return check

~\AppData\Local\Programs\Python\Python39\lib\site-packages\malaya\sentiment.py in transformer(model, quantized, **kwargs)
110 'model not supported, please check supported models from malaya.sentiment.available_transformer().'
111 )
--> 112 return classification.transformer(
113 module='sentiment',
114 label=label,

~\AppData\Local\Programs\Python\Python39\lib\site-packages\malaya\supervised\classification.py in transformer(module, label, model, sigmoid, quantized, **kwargs)
167
168 outputs = ['logits', 'logits_seq']
--> 169 tokenizer = TOKENIZER_MODEL[model](vocab_file=path['vocab'], spm_model_file=path['tokenizer'])
170 input_nodes, output_nodes = nodes_session(
171 g,

~\AppData\Local\Programs\Python\Python39\lib\site-packages\malaya\text\bpe.py in init(self, vocab_file, spm_model_file, **kwargs)
85
86 with open(vocab_file, encoding="utf-8") as fopen:
---> 87 v = fopen.read().split("\n")[:-1]
88 v = [i.split("\t") for i in v]
89 self.vocab = {i[0]: i[1] for i in v}

~\AppData\Local\Programs\Python\Python39\lib\encodings\cp1252.py in decode(self, input, final)
21 class IncrementalDecoder(codecs.IncrementalDecoder):
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
24
25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 117: character maps to `

when I run the transformer for BERT and ALBERT in the sentiment and emotion models. The code I ran was:
malaya.sentiment.transformer(model = 'tiny-bert')
malaya.sentiment.transformer(model = 'bert')

I attempted to fix it by adding the encoding="utf-8" to the open function, but it still doesn't run.