lonePatient / Bert-Multi-Label-Text-Classification

This repo contains a PyTorch implementation of a pretrained BERT model for multi-label text classification.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error in training

RGaonkar opened this issue · comments

This is a great project! Finally a great repo on multi-label classification with BERT. I am trying to train the bert model. I get the following error while reading the config file:
File "/home/rgaonkar/context_home/rgaonkar/virtualenv/label_env_new/lib/python3.6/site-packages/pytorch_transformers/modeling_utils.py", line 177, in from_pretrained config = cls.from_json_file(resolved_config_file) File "/home/rgaonkar/context_home/rgaonkar/virtualenv/label_env_new/lib/python3.6/site-packages/pytorch_transformers/modeling_utils.py", line 206, in from_json_file text = reader.read() File "/home/rgaonkar/context_home/rgaonkar/virtualenv/label_env_new/lib/python3.6/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

I am struggling to solve this issue. Any help is highly appreciated!

commented

There are unicode characters in your file, either remove them as part of data cleansing activity or convert to utf-8 like

with open('your file name', 'w', encoding='utf-8') as f:
    print(r['body'], file=f)

HTH