ztxz16 / fastllm

纯c++的全平台llm加速库,支持python调用,chatglm-6B级模型单卡可达10000+token / s,支持glm, llama, moss基座,手机端流畅运行

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

flm的tokenizer和原始tokenizer分词结果不一致

yiguanxian opened this issue · comments

chatglm2和baichuan2都有这个问题。

  1. 模型编译方式
    from fastllm_pytools import llm
    from transformers import AutoTokenizer, AutoModel

hf_model = "/workspace/chatglm2-6B"

flm_dtype = "int8"
model_name = hf_model.split("/")[-1]
flm_model = f"/workspace/models/{model_name}-fastllm-{flm_dtype}.flm"

tokenizer = AutoTokenizer.from_pretrained(hf_model, trust_remote_code=True)
model = AutoModel.from_pretrained(hf_model, trust_remote_code=True).half().cuda()
model = llm.from_hf(model, tokenizer, dtype=flm_dtype)
model.save(flm_model)

  1. 测试代码
    prompt_input = "[Round 1]"

    from transformers import AutoTokenizer
    model_path = "/workspace/chatglm2-6B"
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    print(f"src prompt: {prompt_input}, token id: {tokenizer.encode(prompt_input)}") # R ound

    import fastllm
    model_path = "/workspace/models/chatglm2-6B-fastllm-int8.flm"
    model = fastllm.create_llm(model_path)
    input_ids = model.weight.tokenizer.encode(prompt_input)
    input_ids = input_ids.to_list()
    input_ids = [int(v) for v in input_ids]
    print(f"fastllm prompt: {prompt_input}, token id: {input_ids}") # Ro und
    3.测试结果
    原始的会将Round这个单词分成"R"和"ound",而flm会将它分成 "Ro"和"und"。另外在百川2上输入"你是可爱",原始的会将其分成"你是"和"可爱",而flm转出来的baichuan2会将其分成" 你","是可","爱”

chatglm3的问题是 model.save()没保存SentencePiece token权重导致的,使用torch2flm.toFile()时无此问题。已作了修复。