shibing624 / textgen

TextGen: Implementation of Text Generation models, include LLaMA, BLOOM, GPT2, BART, T5, SongNet and so on. 文本生成模型,实现了包括LLaMA,ChatGLM,BLOOM,GPT2,Seq2Seq,BART,T5,UDA等模型的训练和预测,开箱即用。

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

读取数据问题

MonkeyTB opened this issue · comments

self.examples = dataset["input_ids"]

你好,麻烦请问一下,这里这样读取数据后,在 chatglm_model.py第243-245显示读取的数据为空,这里应该怎么理解?

补充一下:训练会报错数据为空

没下载ADGEN 数据集吗?

我用了今天新更新的代码,数据就没问题了,一脸懵逼,看代码就是少了个filter
有个很奇怪的问题
2023-04-13 11:51:18.354 | INFO | chatglm.chatglm_model:train_model:283 - Training/evaluation parameters TrainingArguments( _n_gpu=3, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=True, bf16=False, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, do_eval=False, do_predict=False, do_train=False, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=True, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=False, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=<HUB_TOKEN>, ignore_data_skip=False, include_inputs_for_metrics=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=0.0002, length_column_name=length, load_best_model_at_end=False, local_rank=-1, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=./result//logs, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=50, logging_strategy=steps, lr_scheduler_type=linear, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=1, optim=adamw_torch, optim_args=None, output_dir=./result/, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=1, per_device_train_batch_size=1, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=<PUSH_TO_HUB_TOKEN>, ray_scope=last, remove_unused_columns=False, report_to=['tensorboard', 'wandb'], resume_from_checkpoint=None, run_name=./result/, save_on_each_node=False, save_steps=400, save_strategy=steps, save_total_limit=3, seed=42, sharded_ddp=[], skip_memory_metrics=True, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, xpu_backend=None, ) 2023-04-13 11:51:18.501 | INFO | chatglm.chatglm_model:train_model:297 - *** Train *** wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: 0 wandb: WARNING Invalid choice wandb: Enter your choice:

  1. 是n_gpu=3?但是我找了所有配置和赋值,没有发现在哪里赋值的3,配置文件我看是1
  2. wandb 这是什么?让我手动输入 enter you choice,多次输入之后(瞎输入)就或出现下面的连接
    wandb: You chose 'Create a W&B account' wandb: Create an account here: https://wandb.ai/authorize?signup=true wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

那就用最新代码,wandb是训练日志记录,不用管。

2023-04-13 12:23:01.014 | INFO | chatglm.chatglm_model:train_model:297 - *** Train ***
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: e091e352ec72db11655f6fa7dcfd6d4a7b83xxxx
wandb: WARNING Invalid choice
wandb: Enter your choice: glm
wandb: WARNING Invalid choice
wandb: Enter your choice: 111
wandb: WARNING Invalid choice
wandb: Enter your choice: 0
wandb: WARNING Invalid choice
wandb: Enter your choice: 1
wandb: You chose 'Create a W&B account'
wandb: Create an account here: https://wandb.ai/authorize?signup=true
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:
wandb: ERROR API key must be 40 characters long, yours was 1
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice:

大佬,这个不管不进行训练,我去屏蔽了import wandb也还是会弹出来这个,强制性要我输入

我注册了一个账号,输入了40位的 key ,还是不行😓

export WANDB_MODE=offline

    input_text, target_text = example["content"], example["summary"]
    instruction = ”改写为电商广告文案:“
    prompt = f"问:{instruction}\n{input_text}\n答:"
    prompt_ids = tokenizer.encode(prompt, max_length=args.max_seq_length)
    target_ids = tokenizer.encode(target_text, max_length=args.max_length,
                                  add_special_tokens=False)
    input_ids = prompt_ids + target_ids
    input_ids = input_ids[:(args.max_seq_length + args.max_length)] + [tokenizer.eos_token_id]

    example['input_ids'] = input_ids
    return example```
大佬,这里感觉有点问题,
`input_ids = prompt_ids + target_ids` 
应该改为
`input_ids = prompt_ids + [tokenizer.bos_token_id] + target_ids`
在chatglm_model.py中这里找的是prompt的bos_token_id对prompt部分进行ignore
`    def data_collator(self, batch):
        len_ids = [len(example) for example in batch]
        longest = max(len_ids)
        input_ids = []
        labels_list = []
        for ids_l, example in sorted(zip(len_ids, batch), key=lambda x: -x[0]):
            ids = list(example)
            logger.info(ids)
            seq_len = ids.index(self.tokenizer.bos_token_id) + 1  # is equal to prompt length
            ignore_idx = -100
            labels = ([ignore_idx] * (seq_len - 1) + ids[(seq_len - 1):] + [ignore_idx] * (longest - ids_l))
            ids = ids + [self.tokenizer.pad_token_id] * (longest - ids_l)
            _ids = torch.LongTensor(ids)
            labels_list.append(torch.LongTensor(labels))
            input_ids.append(_ids)
        input_ids = torch.stack(input_ids)
        labels = torch.stack(labels_list)
        return {"input_ids": input_ids, "labels": labels}`


不知道我这里理解的对不对

export WANDB_MODE=offline

感谢,我实在没办法,卸载了wandb就可以了,我装上再试试这个😓

2023-04-13 12:23:01.014 | INFO | chatglm.chatglm_model:train_model:297 - *** Train *** wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice: e091e352ec72db11655f6fa7dcfd6d4a7b83xxxx wandb: WARNING Invalid choice wandb: Enter your choice: glm wandb: WARNING Invalid choice wandb: Enter your choice: 111 wandb: WARNING Invalid choice wandb: Enter your choice: 0 wandb: WARNING Invalid choice wandb: Enter your choice: 1 wandb: You chose 'Create a W&B account' wandb: Create an account here: https://wandb.ai/authorize?signup=true wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit: wandb: ERROR API key must be 40 characters long, yours was 1 wandb: (1) Create a W&B account wandb: (2) Use an existing W&B account wandb: (3) Don't visualize my results wandb: Enter your choice:

大佬,这个不管不进行训练,我去屏蔽了import wandb也还是会弹出来这个,强制性要我输入

你 选 3 就行了。

    input_text, target_text = example["content"], example["summary"]
    instruction = ”改写为电商广告文案:“
    prompt = f"问:{instruction}\n{input_text}\n答:"
    prompt_ids = tokenizer.encode(prompt, max_length=args.max_seq_length)
    target_ids = tokenizer.encode(target_text, max_length=args.max_length,
                                  add_special_tokens=False)
    input_ids = prompt_ids + target_ids
    input_ids = input_ids[:(args.max_seq_length + args.max_length)] + [tokenizer.eos_token_id]

    example['input_ids'] = input_ids
    return example```
大佬,这里感觉有点问题,
`input_ids = prompt_ids + target_ids` 
应该改为
`input_ids = prompt_ids + [tokenizer.bos_token_id] + target_ids`
在chatglm_model.py中这里找的是prompt的bos_token_id对prompt部分进行ignore
`    def data_collator(self, batch):
        len_ids = [len(example) for example in batch]
        longest = max(len_ids)
        input_ids = []
        labels_list = []
        for ids_l, example in sorted(zip(len_ids, batch), key=lambda x: -x[0]):
            ids = list(example)
            logger.info(ids)
            seq_len = ids.index(self.tokenizer.bos_token_id) + 1  # is equal to prompt length
            ignore_idx = -100
            labels = ([ignore_idx] * (seq_len - 1) + ids[(seq_len - 1):] + [ignore_idx] * (longest - ids_l))
            ids = ids + [self.tokenizer.pad_token_id] * (longest - ids_l)
            _ids = torch.LongTensor(ids)
            labels_list.append(torch.LongTensor(labels))
            input_ids.append(_ids)
        input_ids = torch.stack(input_ids)
        labels = torch.stack(labels_list)
        return {"input_ids": input_ids, "labels": labels}`


不知道我这里理解的对不对

没错,因为prompt_ids 默认是有 add_special_tokens=True,里面会带有bos + gmask

    input_text, target_text = example["content"], example["summary"]
    instruction = ”改写为电商广告文案:“
    prompt = f"问:{instruction}\n{input_text}\n答:"
    prompt_ids = tokenizer.encode(prompt, max_length=args.max_seq_length)
    target_ids = tokenizer.encode(target_text, max_length=args.max_length,
                                  add_special_tokens=False)
    input_ids = prompt_ids + target_ids
    input_ids = input_ids[:(args.max_seq_length + args.max_length)] + [tokenizer.eos_token_id]

    example['input_ids'] = input_ids
    return example```
大佬,这里感觉有点问题,
`input_ids = prompt_ids + target_ids` 
应该改为
`input_ids = prompt_ids + [tokenizer.bos_token_id] + target_ids`
在chatglm_model.py中这里找的是prompt的bos_token_id对prompt部分进行ignore
`    def data_collator(self, batch):
        len_ids = [len(example) for example in batch]
        longest = max(len_ids)
        input_ids = []
        labels_list = []
        for ids_l, example in sorted(zip(len_ids, batch), key=lambda x: -x[0]):
            ids = list(example)
            logger.info(ids)
            seq_len = ids.index(self.tokenizer.bos_token_id) + 1  # is equal to prompt length
            ignore_idx = -100
            labels = ([ignore_idx] * (seq_len - 1) + ids[(seq_len - 1):] + [ignore_idx] * (longest - ids_l))
            ids = ids + [self.tokenizer.pad_token_id] * (longest - ids_l)
            _ids = torch.LongTensor(ids)
            labels_list.append(torch.LongTensor(labels))
            input_ids.append(_ids)
        input_ids = torch.stack(input_ids)
        labels = torch.stack(labels_list)
        return {"input_ids": input_ids, "labels": labels}`


不知道我这里理解的对不对

没错,因为prompt_ids 默认是有 add_special_tokens=True,里面会带有bos + gmask

我在排查排查吧,我设置为True,会补两个0,也就是两个gmask,不会补bos的token_id。感谢开源

train_dataset len: 10000, train_dataset[0]: [ 5 64286 12 65601 115448 68816 94113 75564 66104 63823
63976 70705 6 64157 64091 66889 64447 63823 4 95059
78289 63825 72663 12 28 64265 69028 63907 65667 6
70283 63854 64091 69466 97891 73134 6 63847 65283 64472
66876 78 4 4 67342 12 130001 130004 65831 72663
65247 75564 66104 63823 130005]

这里的:
130001 130004

130001就是bos, 130004就是gmask

train_dataset len: 10000, train_dataset[0]: [ 5 64286 12 65601 115448 68816 94113 75564 66104 63823 63976 70705 6 64157 64091 66889 64447 63823 4 95059 78289 63825 72663 12 28 64265 69028 63907 65667 6 70283 63854 64091 69466 97891 73134 6 63847 65283 64472 66876 78 4 4 67342 12 130001 130004 65831 72663 65247 75564 66104 63823 130005]

这里的: 130001 130004

130001就是bos, 130004就是gmask

add special tokens True: [5, 66219, 1389, 64812, 69171, 0, 0]
add special tokens False [5, 66219, 1389, 64812, 69171]

更改ice_text.model之后就正常了
add special tokens True: [5, 66219, 1389, 64812, 69171, 130001, 130004]
add special tokens False [5, 66219, 1389, 64812, 69171]
之前更新的和没跟新的好像没有完全替代,导致出现了混乱。没更新之前打印 tokenizer.gmask_token_id = 0,目前更新后没有问题了,正常运行了

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.(由于长期不活动,机器人自动关闭此问题,如果需要欢迎提问)