airaria / TextBrewer

A PyTorch-based knowledge distillation toolkit for natural language processing

Home Page:http://textbrewer.hfl-rc.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

msra_ner.ipynb复现代码bug

HXYstudy opened this issue · comments

单纯训练BERT过程和hugging face的一样没问题,但是在蒸馏的部分,distill提示train_dataloader 没有定义,后来添加了
train_dataloader = torch.utils.data.DataLoader(tokenized_datasets["train"],batch_size=8)
会出现以下问题
26 with distiller:
---> 27 distiller.train(optimizer, train_dataloader, num_epochs, scheduler_class=scheduler_class, scheduler_args = scheduler_args, callback=None)

8 frames
/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/collate.py in default_collate(batch)
80 elem_size = len(next(it))
81 if not all(len(elem) == elem_size for elem in it):
---> 82 raise RuntimeError('each element in list of batch should be of equal size')
83 transposed = zip(*batch)
84 return [default_collate(samples) for samples in transposed]

RuntimeError: each element in list of batch should be of equal size

希望老师和路过的朋友解答一下,这个问题困扰了好几天了

是hugging face的dataset类型
Dataset({
features: ['attention_mask', 'id', 'input_ids', 'labels', 'ner_tags', 'token_type_ids', 'tokens'],
num_rows: 45001
})
这个类型里面第1条样本的个别值分别是:
input_ids:
[101,
2496,
2361,
3307,
2339,
4923,
3131,
1221,
4638,
4636,
674,
1036,
4997,
2768,
7270,
6629,
3341,
8024,
4906,
3136,
1069,
1744,
5917,
4197,
2768,
7599,
3198,
8024,
791,
1921,
3300,
3119,
5966,
817,
966,
4638,
741,
872,
3766,
743,
8024,
3209,
3189,
2218,
1373,
872,
2637,
679,
2496,
1159,
8013,
102]
tokens:
['当',
'希',
'望',
'工',
'程',
'救',
'助',
'的',
'百',
'万',
'儿',
'童',
'成',
'长',
'起',
'来',
',',
'科',
'教',
'兴',
'国',
'蔚',
'然',
'成',
'风',
'时',
',',
'今',
'天',
'有',
'收',
'藏',
'价',
'值',
'的',
'书',
'你',
'没',
'买',
',',
'明',
'日',
'就',
'叫',
'你',
'悔',
'不',
'当',
'初',
'!']
感觉是因为没有padding才使得每个样本不一样,但是在之前的tokenize_and_align_labels的函数里用过tokenizer函数,这里是不是需要按batch进行padding,因为hugging face的官方文档的数据也是这么做的,所以到这一步之后我不太会进一步将dataset对象进行处理了,希望请教老师,并想请问是否能更新修复示例中的notebook

commented

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

commented

Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.

hi, you can follow transformers official demo run_ner.py.

use a data collator that pads and ensures the length of sentences in a batch are equal

from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=32,
)