[QA] InternLM 2 对文字种类的识别, 生成能力以及微调相关问题
timousT opened this issue · comments
Describe the question.
- InternLM2对繁体中文的识别及生成能力是怎么样的?
- 如果用XTuner微调,应该怎么微调增加分词表的大小,来支援繁体中文?
- 假设不用XTuner微调,我应该要用什么工具去微调增加分词表的大小,来支援繁体中文?
This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 7 days if the stale label is not removed or if there is no further response.
@zhangxc11 @lvhan028 @sunpengsdu @gouchangjiang 您们好, 能回答一下这个问题吗? 谢谢!
@timousT 如果想用 XTuner 进行扩词表训练,可以按以下步骤进行
- 挑选对应模型的 config 模板
xtuner list-cfg -p internlm2
xtuner copy-cfg internlm2_chat_7b_qlora_alpaca_e3 ./internlm2_chat_7b_new_tokenzier_alpaca_e3.py
-
在 config 中设置扩词表后的 tokenizer
vi internlm2_chat_7b_new_tokenzier_alpaca_e3.py
https://github.com/InternLM/xtuner/blob/56dbdd7610f99c5cd22c7fa59846fe46906370f7/xtuner/configs/internlm/internlm2_chat_7b/internlm2_chat_7b_qlora_alpaca_e3.py#L63 -
使用全参数微调
XTuner 默认使用 QLoRA 进行微调,不会训练 embbeding 和最后的 fc
只需要将 config 中的quantization_config
和lora
删掉即可
vi internlm2_chat_7b_new_tokenzier_alpaca_e3.py
https://github.com/InternLM/xtuner/blob/main/xtuner/configs/internlm/internlm2_chat_7b/internlm2_chat_7b_qlora_alpaca_e3.py#L75-L90 -
使用 DeepSpeed 启动训练
`xtuner train internlm2_chat_7b_new_tokenzier_alpaca_e3.py --deepspeed_deepspeed_zero3
如果想训练自己的数据,可以参考 XTuner 文档准备数据集
https://github.com/InternLM/xtuner/blob/main/docs/zh_cn/user_guides/multi_turn_conversation.md
This issue is marked as stale because it has been marked as invalid or awaiting response for 7 days without any further response. It will be closed in 7 days if the stale label is not removed or if there is no further response.
This issue is closed because it has been stale for 7 days. Please open a new issue if you have similar issues or you have any new updates now.