mlc-ai / web-llm

High-performance In-browser LLM Inference Engine

Home Page:https://webllm.mlc.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

In the Llama-2-7b-chat-hf-q4f32_1-1k model, the number of tokens in the prefill is 36 when inputting 'hello'.

137591 opened this issue · comments

Why is the number of tokens in the prompt's output different from the actual number of tokens produced by the tokenizer?
I used the LLaMA 2 tokenizer, and the prompt 'hello' is only split into 2 tokens. However, the prefill count provided by the project is 36 tokens, and experiments have confirmed that all prefill outputs have 34 more tokens than the original prompt's token count. Please explain the reason.

They are due to the system prompt as shown in the mlc-chat-config.json: https://huggingface.co/mlc-ai/Llama-2-7b-chat-hf-q4f16_1-MLC/blob/main/mlc-chat-config.json#L33-L34, which follow the specification of the official model releases.

If you'd like not to use a system prompt, try overriding it with an empty string:

  const request: webllm.ChatCompletionRequest = {
    messages: [
      {"role": "system", "content": ""},
      { "role": "user", "content": "Hello" },
    ],
  };

They are due to the system prompt as shown in the mlc-chat-config.json: https://huggingface.co/mlc-ai/Llama-2-7b-chat-hf-q4f16_1-MLC/blob/main/mlc-chat-config.json#L33-L34, which follow the specification of the official model releases.它们是由于 mlc-chat-config.json 中所示的系统提示造成的:https://huggingface.co/mlc-ai/Llama-2-7b-chat-hf-q4f16_1-MLC/blob/main/mlc -chat-c​​onfig.json#L33-L34,遵循官方模型发布的规范。

If you'd like not to use a system prompt, try overriding it with an empty string:如果您不想使用系统提示符,请尝试使用空字符串覆盖它:

  const request: webllm.ChatCompletionRequest = {
    messages: [
      {"role": "system", "content": ""},
      { "role": "user", "content": "Hello" },
    ],
  };

got it!thank you very much!