make_input和model.weight.tokenizer.encode会产生多余空格问题

Question

make_input和model.weight.tokenizer.encode会产生多余空格问题

yiguanxian opened this issue 6 months ago · comments

模型：baichuan2-13B-chat

问题1：
复现代码块：
In [4]: import pyfastllm
In [5]: model = pyfastllm.create_model("baichuan2-int8.flm")
In [6]: prompt = model.make_input("", 0, "你好")
In [7]: prompt
Out[7]: '<FLM_FIX_TOKEN_195> 你好<FLM_FIX_TOKEN_196>'
问题：可以看到使用make_input后在“你好”前多了个空格

问题2：
复现代码块：
In [7]: model = pyfastllm.create_model("baichuan2-int8.flm")
In [8]: prompt = model.make_input("", 0, "你好")
In [9]: final_prompt = "这是pre prompt" + prompt
In [10]: input_id = model.weight.tokenizer.encode(final_prompt)
In [11]: input_id = input_id.to_list()
In [12]: input_id = [int(v) for v in input_id]
In [13]: input_id
Out[13]: [92311, 2691, 4596, 12909, 195, 100030, 92428, 196]
In [14]: model.weight.tokenizer.decode(input_id)
Out[14]: ' 这是pre prompt<reserved_106> 你好<reserved_107>'
In [15]: model.weight.tokenizer.decode([2691])
Out[15]: '这是'
In [16]: model.weight.tokenizer.decode([92311])
Out[16]: ' '
问题：我在make_input后在prompt前加了个自定义pre_prompt("这是pre prompt")，然后用model.weight.tokenizer.encode编码，可以看到编码得到的token会多个92311，这个token就是空格（从decode的结果也可以看到在"这是"前多了个空格）

buchidanhuanger · Answer 1 · Tue Jan 16 2024 21:36:29 GMT+0800 (China Standard Time)

另外，我为什么要在make_input产生的prompt前加pre_promt，是因为我发现如果把pre_prompt放到转模型中去会很不方便，因为一旦我修改pre_prompt又要去转一次模型，这样很不方便，因此我把它放到模型推理时来拼接（pyfastllm.create_model创建的model又无法访问pre_prompt属性，因此无法重置只能拼接了）。

Zhiwei35 · Answer 2 · Wed Jan 17 2024 16:11:35 GMT+0800 (China Standard Time)

+1, 我在单独使用fastllm的tokenizer encode的时候，输入一个英文句子，也会产生多余的空格，不确定这会不会对推理结果造成影响

TylunasLi · Answer 3 · Mon Jan 22 2024 23:51:34 GMT+0800 (China Standard Time)

根据sentencepiece_model.proto的定义：

  // Adds dummy whitespace at the beginning of text in order to
  // treat "world" in "world" and "hello world" in the same way.
  optional bool add_dummy_prefix = 3 [default = true];

值add_dummy_prefix用来控制是否在输入序列i前面加空格。
这个值在不同模型中是不一样的。例如：

ChatGLM3-6B：

>>> import sentencepiece.sentencepiece_model_pb2 as model
>>> m = model.ModelProto()
>>> m.ParseFromString(open('tokenizer.model', 'rb').read())
1018370
>>> m.normalizer_spec.add_dummy_prefix
True

Baichuan2-7B-Chat：

>>> import sentencepiece.sentencepiece_model_pb2 as model
>>> m = model.ModelProto()
>>> m.ParseFromString(open('Baichuan2-7B-Chat/tokenizer.model', 'rb').read())
2001107
>>> m.normalizer_spec.add_dummy_prefix
False

目前，fastllm没有支持读取这个值。