[QA] InternLM 2 对文字种类的识别, 生成能力以及微调相关问题

Question

[QA] InternLM 2 对文字种类的识别, 生成能力以及微调相关问题

timousT opened this issue 3 months ago · comments

Describe the question.

InternLM2对繁体中文的识别及生成能力是怎么样的?
如果用XTuner微调，应该怎么微调增加分词表的大小，来支援繁体中文?
假设不用XTuner微调，我应该要用什么工具去微调增加分词表的大小，来支援繁体中文?

github-actions · Answer 1 · Sat Mar 02 2024 10:01:11 GMT+0800 (China Standard Time)

This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 7 days if the stale label is not removed or if there is no further response.

timousT · Answer 2 · Mon Mar 04 2024 18:31:02 GMT+0800 (China Standard Time)

@zhangxc11 @lvhan028 @sunpengsdu @gouchangjiang 您们好, 能回答一下这个问题吗? 谢谢!

Lyu Han · Answer 3 · Tue Mar 05 2024 11:25:44 GMT+0800 (China Standard Time)

Hi, @timousT 抱歉回复晚了。
关于使用 xtuner 的微调，@pppppM 更清楚些。我请他来回复下

pppppM · Answer 4 · Tue Mar 05 2024 11:52:18 GMT+0800 (China Standard Time)

@timousT 如果想用 XTuner 进行扩词表训练，可以按以下步骤进行

挑选对应模型的 config 模板

xtuner list-cfg -p internlm2
xtuner copy-cfg internlm2_chat_7b_qlora_alpaca_e3 ./internlm2_chat_7b_new_tokenzier_alpaca_e3.py

在 config 中设置扩词表后的 tokenizer
vi internlm2_chat_7b_new_tokenzier_alpaca_e3.py
https://github.com/InternLM/xtuner/blob/56dbdd7610f99c5cd22c7fa59846fe46906370f7/xtuner/configs/internlm/internlm2_chat_7b/internlm2_chat_7b_qlora_alpaca_e3.py#L63
使用全参数微调
XTuner 默认使用 QLoRA 进行微调，不会训练 embbeding 和最后的 fc
只需要将 config 中的 quantization_config 和 lora 删掉即可
vi internlm2_chat_7b_new_tokenzier_alpaca_e3.py
https://github.com/InternLM/xtuner/blob/main/xtuner/configs/internlm/internlm2_chat_7b/internlm2_chat_7b_qlora_alpaca_e3.py#L75-L90
使用 DeepSpeed 启动训练
`xtuner train internlm2_chat_7b_new_tokenzier_alpaca_e3.py --deepspeed_deepspeed_zero3

如果想训练自己的数据，可以参考 XTuner 文档准备数据集
https://github.com/InternLM/xtuner/blob/main/docs/zh_cn/user_guides/multi_turn_conversation.md

github-actions · Answer 5 · Sat Mar 16 2024 10:02:46 GMT+0800 (China Standard Time)

This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 7 days if the stale label is not removed or if there is no further response.

github-actions · Answer 6 · Sat Mar 23 2024 10:03:48 GMT+0800 (China Standard Time)

This issue is closed because it has been stale for 7 days. Please open a new issue if you have similar issues or you have any new updates now.