xxw1995 / chatglm3-finetune

最容易上手的0门槛 chatglm3 & agent & langchain 项目

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

UnicodeDecodeError: 'gbk' codec can't decode byte 0xa4 in position 64: illegal multibyte sequence

lhtpluto opened this issue · comments

2023-11-03 20:10:25,978 - WARNING - Loading data...
Traceback (most recent call last):
File "D:\test\chatglm3-base-tuning-master\train.py", line 52, in
trainer.train()
File "D:\test\chatglm3-base-tuning-master\trainer.py", line 19, in train
self.data_module = ChatDataModule(
^^^^^^^^^^^^^^^
File "D:\test\chatglm3-base-tuning-master\chat_data_module.py", line 75, in init
self.train_dataset = ChatDataset(tokenizer=tokenizer, data_path=data_path_train, max_tokens=max_tokens)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\test\chatglm3-base-tuning-master\chat_data_module.py", line 37, in init
conversations = jload(data_path)
^^^^^^^^^^^^^^^^
File "D:\test\chatglm3-base-tuning-master\chat_data_module.py", line 28, in jload
jdict = json.load(f)
^^^^^^^^^^^^
File "D:\test\chatglm3-base-tuning-master\env\Lib\json_init_.py", line 293, in load
return loads(fp.read(),
^^^^^^^^^
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa4 in position 64: illegal multibyte sequence

使用的formatted_samples.json

commented

数据集编码格式不对

数据集编码格式不对

正确的编码格式是什么? formatted_samples.json是UTF-8的