msra_ner.ipynb复现代码bug

Question

msra_ner.ipynb复现代码bug

HXYstudy opened this issue 3 years ago · comments

单纯训练BERT过程和hugging face的一样没问题，但是在蒸馏的部分，distill提示train_dataloader 没有定义，后来添加了
train_dataloader = torch.utils.data.DataLoader(tokenized_datasets["train"],batch_size=8)
会出现以下问题
26 with distiller:
---> 27 distiller.train(optimizer, train_dataloader, num_epochs, scheduler_class=scheduler_class, scheduler_args = scheduler_args, callback=None)

8 frames
/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py in default_collate(batch)
80 elem_size = len(next(it))
81 if not all(len(elem) == elem_size for elem in it):
---> 82 raise RuntimeError('each element in list of batch should be of equal size')
83 transposed = zip(*batch)
84 return [default_collate(samples) for samples in transposed]

RuntimeError: each element in list of batch should be of equal size

HXYstudy · Answer 1 · Thu Nov 25 2021 10:00:51 GMT+0800 (China Standard Time)

希望老师和路过的朋友解答一下，这个问题困扰了好几天了

Ziqing Yang · Answer 2 · Thu Nov 25 2021 10:15:05 GMT+0800 (China Standard Time)

似乎是训练集格式问题。tokenized_datasets["train”]中的每一条样本长什么样？在2021年11月20日 ***@***.***> 写道：单纯训练BERT过程和hugging face的一样没问题，但是在蒸馏的部分，distill提示train_dataloader 没有定义，后来添加了 train_dataloader = torch.utils.data.DataLoader(tokenized_datasets["train"],batch_size=8) 会出现以下问题 26 with distiller: ---> 27 distiller.train(optimizer, train_dataloader, num_epochs, scheduler_class=scheduler_class, scheduler_args = scheduler_args, callback=None) 8 frames /usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py in default_collate(batch) 80 elem_size = len(next(it)) 81 if not all(len(elem) == elem_size for elem in it): ---> 82 raise RuntimeError('each element in list of batch should be of equal size') 83 transposed = zip(*batch) 84 return [default_collate(samples) for samples in transposed] RuntimeError: each element in list of batch should be of equal size — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

HXYstudy · Answer 3 · Thu Nov 25 2021 11:28:32 GMT+0800 (China Standard Time)

是hugging face的dataset类型
Dataset({
features: ['attention_mask', 'id', 'input_ids', 'labels', 'ner_tags', 'token_type_ids', 'tokens'],
num_rows: 45001
})
这个类型里面第1条样本的个别值分别是：
input_ids:
[101,
2496,
2361,
3307,
2339,
4923,
3131,
1221,
4638,
4636,
674,
1036,
4997,
2768,
7270,
6629,
3341,
8024,
4906,
3136,
1069,
1744,
5917,
4197,
2768,
7599,
3198,
8024,
791,
1921,
3300,
3119,
5966,
817,
966,
4638,
741,
872,
3766,
743,
8024,
3209,
3189,
2218,
1373,
872,
2637,
679,
2496,
1159,
8013,
102]
tokens:
['当',
'希',
'望',
'工',
'程',
'救',
'助',
'的',
'百',
'万',
'儿',
'童',
'成',
'长',
'起',
'来',
'，',
'科',
'教',
'兴',
'国',
'蔚',
'然',
'成',
'风',
'时',
'，',
'今',
'天',
'有',
'收',
'藏',
'价',
'值',
'的',
'书',
'你',
'没',
'买',
'，',
'明',
'日',
'就',
'叫',
'你',
'悔',
'不',
'当',
'初',
'！']
感觉是因为没有padding才使得每个样本不一样，但是在之前的tokenize_and_align_labels的函数里用过tokenizer函数，这里是不是需要按batch进行padding，因为hugging face的官方文档的数据也是这么做的，所以到这一步之后我不太会进一步将dataset对象进行处理了，希望请教老师，并想请问是否能更新修复示例中的notebook

stale · Answer 4 · Tue Nov 30 2021 12:07:18 GMT+0800 (China Standard Time)

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · Answer 5 · Tue Dec 07 2021 13:02:35 GMT+0800 (China Standard Time)

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.

xianyu · Answer 6 · Wed Jan 19 2022 11:30:37 GMT+0800 (China Standard Time)

hi, you can follow transformers official demo run_ner.py.

Busayor · Answer 7 · Sat Apr 08 2023 08:07:48 GMT+0800 (China Standard Time)

use a data collator that pads and ensures the length of sentences in a batch are equal

from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=32,
)