UnicodeDecodeError: 'gbk' codec can't decode byte 0xa4 in position 64: illegal multibyte sequence
lhtpluto opened this issue · comments
2023-11-03 20:10:25,978 - WARNING - Loading data...
Traceback (most recent call last):
File "D:\test\chatglm3-base-tuning-master\train.py", line 52, in
trainer.train()
File "D:\test\chatglm3-base-tuning-master\trainer.py", line 19, in train
self.data_module = ChatDataModule(
^^^^^^^^^^^^^^^
File "D:\test\chatglm3-base-tuning-master\chat_data_module.py", line 75, in init
self.train_dataset = ChatDataset(tokenizer=tokenizer, data_path=data_path_train, max_tokens=max_tokens)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\test\chatglm3-base-tuning-master\chat_data_module.py", line 37, in init
conversations = jload(data_path)
^^^^^^^^^^^^^^^^
File "D:\test\chatglm3-base-tuning-master\chat_data_module.py", line 28, in jload
jdict = json.load(f)
^^^^^^^^^^^^
File "D:\test\chatglm3-base-tuning-master\env\Lib\json_init_.py", line 293, in load
return loads(fp.read(),
^^^^^^^^^
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa4 in position 64: illegal multibyte sequence
使用的formatted_samples.json
数据集编码格式不对
数据集编码格式不对
正确的编码格式是什么? formatted_samples.json是UTF-8的