shibing624 / textgen

textgen/examples/chatglm/training_chatglm_adgen_demo.py

Line 47 in 0339b3e

self.examples = dataset["input_ids"]

你好，麻烦请问一下，这里这样读取数据后，在 chatglm_model.py第243-245显示读取的数据为空，这里应该怎么理解？

补充一下：训练会报错数据为空

没下载ADGEN 数据集吗？

我用了今天新更新的代码，数据就没问题了，一脸懵逼，看代码就是少了个filter
有个很奇怪的问题
2023-04-13 11:51:18.354 | INFO | chatglm.chatglm_model:train_model:283 - Training/evaluation parameters TrainingArguments( _n_gpu=3, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=True, bf16=False, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, do_eval=False, do_predict=False, do_train=False, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=True, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=False, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=<HUB_TOKEN>, ignore_data_skip=False, include_inputs_for_metrics=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=0.0002, length_column_name=length, load_best_model_at_end=False, local_rank=-1, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=./result//logs, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=50, logging_strategy=steps, lr_scheduler_type=linear, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=1, optim=adamw_torch, optim_args=None, output_dir=./result/, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=1, per_device_train_batch_size=1, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=<PUSH_TO_HUB_TOKEN>, ray_scope=last, remove_unused_columns=False, report_to=['tensorboard', 'wandb'], resume_from_checkpoint=None, run_name=./result/, save_on_each_node=False, save_steps=400, save_strategy=steps, save_total_limit=3, seed=42, sharded_ddp=[], skip_memory_metrics=True, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, xpu_backend=None, ) 2023-04-13 11:51:18.501 | INFO | chatglm.chatglm_model:train_model:297 - *** Train *** wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: 0 wandb: WARNING Invalid choice wandb: Enter your choice:

是n_gpu=3?但是我找了所有配置和赋值，没有发现在哪里赋值的3，配置文件我看是1
wandb 这是什么？让我手动输入 enter you choice，多次输入之后（瞎输入）就或出现下面的连接
wandb: You chose 'Create a W&B account' wandb: Create an account here: https://wandb.ai/authorize?signup=true wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

那就用最新代码，wandb是训练日志记录，不用管。

2023-04-13 12:23:01.014 | INFO | chatglm.chatglm_model:train_model:297 - *** Train ***
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: e091e352ec72db11655f6fa7dcfd6d4a7b83xxxx
wandb: WARNING Invalid choice
wandb: Enter your choice: glm
wandb: WARNING Invalid choice
wandb: Enter your choice: 111
wandb: WARNING Invalid choice
wandb: Enter your choice: 0
wandb: WARNING Invalid choice
wandb: Enter your choice: 1
wandb: You chose 'Create a W&B account'
wandb: Create an account here: https://wandb.ai/authorize?signup=true
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:
wandb: ERROR API key must be 40 characters long, yours was 1
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice:

大佬，这个不管不进行训练，我去屏蔽了import wandb也还是会弹出来这个，强制性要我输入

我注册了一个账号，输入了40位的 key ，还是不行😓

export WANDB_MODE=offline

    input_text, target_text = example["content"], example["summary"]
    instruction = ”改写为电商广告文案：“
    prompt = f"问：{instruction}\n{input_text}\n答："
    prompt_ids = tokenizer.encode(prompt, max_length=args.max_seq_length)
    target_ids = tokenizer.encode(target_text, max_length=args.max_length,
                                  add_special_tokens=False)
    input_ids = prompt_ids + target_ids
    input_ids = input_ids[:(args.max_seq_length + args.max_length)] + [tokenizer.eos_token_id]

    example['input_ids'] = input_ids
    return example```
大佬，这里感觉有点问题，
`input_ids = prompt_ids + target_ids` 
应该改为
`input_ids = prompt_ids + [tokenizer.bos_token_id] + target_ids`
在chatglm_model.py中这里找的是prompt的bos_token_id对prompt部分进行ignore
`    def data_collator(self, batch):
        len_ids = [len(example) for example in batch]
        longest = max(len_ids)
        input_ids = []
        labels_list = []
        for ids_l, example in sorted(zip(len_ids, batch), key=lambda x: -x[0]):
            ids = list(example)
            logger.info(ids)
            seq_len = ids.index(self.tokenizer.bos_token_id) + 1  # is equal to prompt length
            ignore_idx = -100
            labels = ([ignore_idx] * (seq_len - 1) + ids[(seq_len - 1):] + [ignore_idx] * (longest - ids_l))
            ids = ids + [self.tokenizer.pad_token_id] * (longest - ids_l)
            _ids = torch.LongTensor(ids)
            labels_list.append(torch.LongTensor(labels))
            input_ids.append(_ids)
        input_ids = torch.stack(input_ids)
        labels = torch.stack(labels_list)
        return {"input_ids": input_ids, "labels": labels}`


不知道我这里理解的对不对

export WANDB_MODE=offline

感谢，我实在没办法，卸载了wandb就可以了，我装上再试试这个😓

2023-04-13 12:23:01.014 | INFO | chatglm.chatglm_model:train_model:297 - *** Train *** wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: e091e352ec72db11655f6fa7dcfd6d4a7b83xxxx wandb: WARNING Invalid choice wandb: Enter your choice: glm wandb: WARNING Invalid choice wandb: Enter your choice: 111 wandb: WARNING Invalid choice wandb: Enter your choice: 0 wandb: WARNING Invalid choice wandb: Enter your choice: 1 wandb: You chose 'Create a W&B account' wandb: Create an account here: https://wandb.ai/authorize?signup=true wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit: wandb: ERROR API key must be 40 characters long, yours was 1 wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice:

大佬，这个不管不进行训练，我去屏蔽了import wandb也还是会弹出来这个，强制性要我输入

你选 3 就行了。

    input_text, target_text = example["content"], example["summary"]
    instruction = ”改写为电商广告文案：“
    prompt = f"问：{instruction}\n{input_text}\n答："
    prompt_ids = tokenizer.encode(prompt, max_length=args.max_seq_length)
    target_ids = tokenizer.encode(target_text, max_length=args.max_length,
                                  add_special_tokens=False)
    input_ids = prompt_ids + target_ids
    input_ids = input_ids[:(args.max_seq_length + args.max_length)] + [tokenizer.eos_token_id]

    example['input_ids'] = input_ids
    return example```
大佬，这里感觉有点问题，
`input_ids = prompt_ids + target_ids` 
应该改为
`input_ids = prompt_ids + [tokenizer.bos_token_id] + target_ids`
在chatglm_model.py中这里找的是prompt的bos_token_id对prompt部分进行ignore
`    def data_collator(self, batch):
        len_ids = [len(example) for example in batch]
        longest = max(len_ids)
        input_ids = []
        labels_list = []
        for ids_l, example in sorted(zip(len_ids, batch), key=lambda x: -x[0]):
            ids = list(example)
            logger.info(ids)
            seq_len = ids.index(self.tokenizer.bos_token_id) + 1  # is equal to prompt length
            ignore_idx = -100
            labels = ([ignore_idx] * (seq_len - 1) + ids[(seq_len - 1):] + [ignore_idx] * (longest - ids_l))
            ids = ids + [self.tokenizer.pad_token_id] * (longest - ids_l)
            _ids = torch.LongTensor(ids)
            labels_list.append(torch.LongTensor(labels))
            input_ids.append(_ids)
        input_ids = torch.stack(input_ids)
        labels = torch.stack(labels_list)
        return {"input_ids": input_ids, "labels": labels}`


不知道我这里理解的对不对

没错，因为prompt_ids 默认是有 add_special_tokens=True，里面会带有bos + gmask

    input_text, target_text = example["content"], example["summary"]
    instruction = ”改写为电商广告文案：“
    prompt = f"问：{instruction}\n{input_text}\n答："
    prompt_ids = tokenizer.encode(prompt, max_length=args.max_seq_length)
    target_ids = tokenizer.encode(target_text, max_length=args.max_length,
                                  add_special_tokens=False)
    input_ids = prompt_ids + target_ids
    input_ids = input_ids[:(args.max_seq_length + args.max_length)] + [tokenizer.eos_token_id]

    example['input_ids'] = input_ids
    return example```
大佬，这里感觉有点问题，
`input_ids = prompt_ids + target_ids` 
应该改为
`input_ids = prompt_ids + [tokenizer.bos_token_id] + target_ids`
在chatglm_model.py中这里找的是prompt的bos_token_id对prompt部分进行ignore
`    def data_collator(self, batch):
        len_ids = [len(example) for example in batch]
        longest = max(len_ids)
        input_ids = []
        labels_list = []
        for ids_l, example in sorted(zip(len_ids, batch), key=lambda x: -x[0]):
            ids = list(example)
            logger.info(ids)
            seq_len = ids.index(self.tokenizer.bos_token_id) + 1  # is equal to prompt length
            ignore_idx = -100
            labels = ([ignore_idx] * (seq_len - 1) + ids[(seq_len - 1):] + [ignore_idx] * (longest - ids_l))
            ids = ids + [self.tokenizer.pad_token_id] * (longest - ids_l)
            _ids = torch.LongTensor(ids)
            labels_list.append(torch.LongTensor(labels))
            input_ids.append(_ids)
        input_ids = torch.stack(input_ids)
        labels = torch.stack(labels_list)
        return {"input_ids": input_ids, "labels": labels}`


不知道我这里理解的对不对

没错，因为prompt_ids 默认是有 add_special_tokens=True，里面会带有bos + gmask

我在排查排查吧，我设置为True，会补两个0，也就是两个gmask，不会补bos的token_id。感谢开源

train_dataset len: 10000, train_dataset[0]: [ 5 64286 12 65601 115448 68816 94113 75564 66104 63823
63976 70705 6 64157 64091 66889 64447 63823 4 95059
78289 63825 72663 12 28 64265 69028 63907 65667 6
70283 63854 64091 69466 97891 73134 6 63847 65283 64472
66876 78 4 4 67342 12 130001 130004 65831 72663
65247 75564 66104 63823 130005]

这里的：
130001 130004

130001就是bos, 130004就是gmask

train_dataset len: 10000, train_dataset[0]: [ 5 64286 12 65601 115448 68816 94113 75564 66104 63823 63976 70705 6 64157 64091 66889 64447 63823 4 95059 78289 63825 72663 12 28 64265 69028 63907 65667 6 70283 63854 64091 69466 97891 73134 6 63847 65283 64472 66876 78 4 4 67342 12 130001 130004 65831 72663 65247 75564 66104 63823 130005]

这里的： 130001 130004

130001就是bos, 130004就是gmask

add special tokens True: [5, 66219, 1389, 64812, 69171, 0, 0]
add special tokens False [5, 66219, 1389, 64812, 69171]

更改ice_text.model之后就正常了
add special tokens True: [5, 66219, 1389, 64812, 69171, 130001, 130004]
add special tokens False [5, 66219, 1389, 64812, 69171]
之前更新的和没跟新的好像没有完全替代，导致出现了混乱。没更新之前打印 tokenizer.gmask_token_id = 0，目前更新后没有问题了，正常运行了

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.(由于长期不活动，机器人自动关闭此问题，如果需要欢迎提问)

读取数据问题