In the Llama-2-7b-chat-hf-q4f32_1-1k model, the number of tokens in the prefill is 36 when inputting 'hello'.

Question

In the Llama-2-7b-chat-hf-q4f32_1-1k model, the number of tokens in the prefill is 36 when inputting 'hello'.

137591 opened this issue a month ago · comments

Why is the number of tokens in the prompt's output different from the actual number of tokens produced by the tokenizer?
I used the LLaMA 2 tokenizer, and the prompt 'hello' is only split into 2 tokens. However, the prefill count provided by the project is 36 tokens, and experiments have confirmed that all prefill outputs have 34 more tokens than the original prompt's token count. Please explain the reason.

Charlie Ruan · Answer 1 · Tue May 14 2024 13:42:45 GMT+0800 (China Standard Time)

They are due to the system prompt as shown in the mlc-chat-config.json: https://huggingface.co/mlc-ai/Llama-2-7b-chat-hf-q4f16_1-MLC/blob/main/mlc-chat-config.json#L33-L34, which follow the specification of the official model releases.

If you'd like not to use a system prompt, try overriding it with an empty string:

  const request: webllm.ChatCompletionRequest = {
    messages: [
      {"role": "system", "content": ""},
      { "role": "user", "content": "Hello" },
    ],
  };

jing xu · Answer 2 · Tue May 14 2024 14:32:00 GMT+0800 (China Standard Time)

They are due to the system prompt as shown in the mlc-chat-config.json: https://huggingface.co/mlc-ai/Llama-2-7b-chat-hf-q4f16_1-MLC/blob/main/mlc-chat-config.json#L33-L34, which follow the specification of the official model releases.它们是由于 mlc-chat-config.json 中所示的系统提示造成的：https://huggingface.co/mlc-ai/Llama-2-7b-chat-hf-q4f16_1-MLC/blob/main/mlc -chat-config.json#L33-L34，遵循官方模型发布的规范。

If you'd like not to use a system prompt, try overriding it with an empty string:如果您不想使用系统提示符，请尝试使用空字符串覆盖它：
  const request: webllm.ChatCompletionRequest = {
    messages: [
      {"role": "system", "content": ""},
      { "role": "user", "content": "Hello" },
    ],
  };

got it！thank you very much！