flm的tokenizer和原始tokenizer分词结果不一致
yiguanxian opened this issue · comments
chatglm2和baichuan2都有这个问题。
- 模型编译方式
from fastllm_pytools import llm
from transformers import AutoTokenizer, AutoModel
hf_model = "/workspace/chatglm2-6B"
flm_dtype = "int8"
model_name = hf_model.split("/")[-1]
flm_model = f"/workspace/models/{model_name}-fastllm-{flm_dtype}.flm"
tokenizer = AutoTokenizer.from_pretrained(hf_model, trust_remote_code=True)
model = AutoModel.from_pretrained(hf_model, trust_remote_code=True).half().cuda()
model = llm.from_hf(model, tokenizer, dtype=flm_dtype)
model.save(flm_model)
-
测试代码
prompt_input = "[Round 1]"from transformers import AutoTokenizer
model_path = "/workspace/chatglm2-6B"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
print(f"src prompt: {prompt_input}, token id: {tokenizer.encode(prompt_input)}") # R oundimport fastllm
model_path = "/workspace/models/chatglm2-6B-fastllm-int8.flm"
model = fastllm.create_llm(model_path)
input_ids = model.weight.tokenizer.encode(prompt_input)
input_ids = input_ids.to_list()
input_ids = [int(v) for v in input_ids]
print(f"fastllm prompt: {prompt_input}, token id: {input_ids}") # Ro und
3.测试结果
原始的会将Round这个单词分成"R"和"ound",而flm会将它分成 "Ro"和"und"。另外在百川2上输入"你是可爱",原始的会将其分成"你是"和"可爱",而flm转出来的baichuan2会将其分成" 你","是可","爱”
chatglm3的问题是 model.save()
没保存SentencePiece token权重导致的,使用torch2flm.toFile()
时无此问题。已作了修复。