WangRongsheng / XrayGLM

🩺 首个会看胸部X光片的中文多模态医学大模型 | The first Chinese Medical Multimodal Model that Chest Radiographs Summarization.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

微调问题:读取图片错误

AriesChen-UPC opened this issue · comments

利用示例程序(bash finetune_XrayGLM.sh)进行数据微调,出现以下错误:

Traceback (most recent call last):
  File "/content/drive/MyDrive/XrayGLM/XrayGLM-6B/finetune_XrayGLM.py", line 194, in <module>
    training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=create_dataset_function, collate_fn=data_collator)
  File "/usr/local/lib/python3.10/dist-packages/sat/training/deepspeed_training.py", line 67, in training_main
    train_data, val_data, test_data = make_loaders(args, hooks['create_dataset_function'], collate_fn=collate_fn)
  File "/usr/local/lib/python3.10/dist-packages/sat/data_utils/configure_data.py", line 197, in make_loaders
    train = make_dataset(**data_set_args, args=args, dataset_weights=args.train_data_weights, is_train_data=True)
  File "/usr/local/lib/python3.10/dist-packages/sat/data_utils/configure_data.py", line 124, in make_dataset_full
    d = create_dataset_function(p, args)
  File "/content/drive/MyDrive/XrayGLM/XrayGLM-6B/finetune_XrayGLM.py", line 160, in create_dataset_function
    dataset = FewShotDataset(path, image_processor, tokenizer, args)
  File "/content/drive/MyDrive/XrayGLM/XrayGLM-6B/finetune_XrayGLM.py", line 117, in __init__
    image = processor(Image.open(item['img']).convert('RGB'))
TypeError: string indices must be integers

测试环境:Google Colab A100
数据存储:Google Drive

PS:与Issues5问题类似,读取存放于Google Drive中的图像等数据,出现问题

commented

经测试,代码finetune_XrayGLM.py是适用于data/demo/下的dataset.json格式
针对读取openi-zh.json数据,我修改了部分代码如下:
注:(使用了一个chardet包检测json编码格式,因发现openi-zh.json编码格式为:ASCII)
1、在FewShotDataset前添加get_encoding函数,获取文件编码

import chardet
def get_encoding(file_path):
    # 以二进制方式打开文件,读取一部分内容,然后检测它的编码
    with open(file_path, 'rb') as f:
        data = f.read(100)  # 只读取一部分,以提高效率
    encod = chardet.detect(data)['encoding']
    return encod

2、更改了FewShotDataset的一些代码

class FewShotDataset(Dataset):
    def __init__(self, path, processor, tokenizer, args):
        max_seq_length = args.max_source_length + args.max_target_length
        self.images = []
        self.input_ids = []
        self.labels = []
        encod = get_encoding(path)
        with open(path, 'r', encoding=encod) as f:
            data = json.load(f)
        data = data['annotations']
        for item in data:
            image = processor(Image.open('data/Xray/' + item['image_id']+'.png').convert('RGB'))
            input0 = tokenizer.encode("<img>", add_special_tokens=False)
            input1 = [tokenizer.pad_token_id] * args.image_length
            input2 = tokenizer.encode("</img>问:通过这张胸部x光影像可以诊断出什么?\n答:", add_special_tokens=False)
            a_ids = sum([input0, input1, input2], [])
            b_ids = tokenizer.encode(text=item['caption'], add_special_tokens=False)

3、后面的没有更改

好的,非常感谢
我会根据您提供的信息进行调试

利用示例程序(bash finetune_XrayGLM.sh)进行数据微调,出现以下错误:

Traceback (most recent call last):
  File "/content/drive/MyDrive/XrayGLM/XrayGLM-6B/finetune_XrayGLM.py", line 194, in <module>
    training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=create_dataset_function, collate_fn=data_collator)
  File "/usr/local/lib/python3.10/dist-packages/sat/training/deepspeed_training.py", line 67, in training_main
    train_data, val_data, test_data = make_loaders(args, hooks['create_dataset_function'], collate_fn=collate_fn)
  File "/usr/local/lib/python3.10/dist-packages/sat/data_utils/configure_data.py", line 197, in make_loaders
    train = make_dataset(**data_set_args, args=args, dataset_weights=args.train_data_weights, is_train_data=True)
  File "/usr/local/lib/python3.10/dist-packages/sat/data_utils/configure_data.py", line 124, in make_dataset_full
    d = create_dataset_function(p, args)
  File "/content/drive/MyDrive/XrayGLM/XrayGLM-6B/finetune_XrayGLM.py", line 160, in create_dataset_function
    dataset = FewShotDataset(path, image_processor, tokenizer, args)
  File "/content/drive/MyDrive/XrayGLM/XrayGLM-6B/finetune_XrayGLM.py", line 117, in __init__
    image = processor(Image.open(item['img']).convert('RGB'))
TypeError: string indices must be integers

测试环境:Google Colab A100 数据存储:Google Drive

PS:与Issues5问题类似,读取存放于Google Drive中的图像等数据,出现问题

执行一下 ./data/build_ch_prompt.py 这个程序, 同时注意一下图片存的路径。然后把finetune_XrayGLM.sh 里面的 json路径改成你刚刚生成的路径即可。作者提供的 openi-zh.json 还不是最终的可训练的 json版本。和visual_GLM 的dataset.json对比一下即可知道。