mlc-ai / web-llm

High-performance In-browser LLM Inference Engine

Home Page:https://webllm.mlc.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Gibberish output with `Llama-2-7b-chat-hf-q4f32_1`

beaufortfrancois opened this issue · comments

Chrome Version: 125.0.6283.3
OS: ChromeOS
GPU: Intel(R) Graphics (ADL GT2) - Intel open-source Mesa driver: Mesa 23.3.0 (git-5cb3f1e4fa)
Dawn Backend: Vulkan

What steps will reproduce the problem?

  1. Go to https://webllm.mlc.ai/#chat-demo
  2. Select Llama-2-7b-chat-hf-q4f32_1
  3. Enter What color is the dress?

What is the expected result?
Some text that at least makes sense.

What happens instead?
Some gibberish text appears.
DevTools JavaScript console contains the following logs:

llm_chat.ts:150 Using prefillChunkSize:  1024
llm_chat.ts:180 Using maxWindowLength:  4096
llm_chat.ts:202 Using Paged KVCache
15vkAllocateMemory failed with VK_ERROR_OUT_OF_DEVICE_MEMORY
    at CheckVkOOMThenSuccessImpl (..<URL>)

15vkAllocateMemory failed with VK_ERROR_OUT_OF_DEVICE_MEMORY
    at CheckVkOOMThenSuccessImpl (..<URL>)

Then I enter "What color is the dress?"

97[Invalid Buffer (unlabeled)] is invalid.
 - While validating entries[0] as a Buffer.
Expected entry layout: { type: BufferBindingType::Storage, hasDynamicOffset: 0, minBindingSize: 0 }
 - While validating [BindGroupDescriptor] against [BindGroupLayout (unlabeled)]
 - While calling [Device].CreateBindGroup([BindGroupDescriptor]).

162[Invalid BindGroup (unlabeled)] is invalid.
 - While encoding [ComputePassEncoder (unlabeled)].SetBindGroup(0, [Invalid BindGroup (unlabeled)], 0, ...).

161[Invalid CommandBuffer] is invalid.
 - While calling [Queue].Submit([[Invalid CommandBuffer]])

97[Invalid Buffer (unlabeled)] is invalid.
 - While validating entries[0] as a Buffer.
Expected entry layout: { type: BufferBindingType::Storage, hasDynamicOffset: 0, minBindingSize: 0 }
 - While validating [BindGroupDescriptor] against [BindGroupLayout (unlabeled)]
 - While calling [Device].CreateBindGroup([BindGroupDescriptor]).

65[Invalid Buffer (unlabeled)] is invalid.
 - While validating entries[0] as a Buffer.
Expected entry layout: { type: BufferBindingType::ReadOnlyStorage, hasDynamicOffset: 0, minBindingSize: 0 }
 - While validating [BindGroupDescriptor] against [BindGroupLayout (unlabeled)]
 - While calling [Device].CreateBindGroup([BindGroupDescriptor]).

162[Invalid BindGroup (unlabeled)] is invalid.
 - While encoding [ComputePassEncoder (unlabeled)].SetBindGroup(0, [Invalid BindGroup (unlabeled)], 0, ...).

161[Invalid CommandBuffer] is invalid.
 - While calling [Queue].Submit([[Invalid CommandBuffer]])

65[Invalid Buffer (unlabeled)] is invalid.
 - While validating entries[0] as a Buffer.
Expected entry layout: { type: BufferBindingType::ReadOnlyStorage, hasDynamicOffset: 0, minBindingSize: 0 }
 - While validating [BindGroupDescriptor] against [BindGroupLayout (unlabeled)]
 - While calling [Device].CreateBindGroup([BindGroupDescriptor]).

WebGPU: too many warnings, no more warnings will be reported to the console for this GPUDevice.
/#chat-demo:1 WebGPU: too many warnings, no more warnings will be reported to the console for this GPUDevice.

Note
It does work properly with the following f16 variants: Llama-2-7b-chat-hf-q4f16_1 and Llama-2-7b-chat-hf-q4f16_1-1k
I can reproduce with Llama-2-13b-chat-hf-q4f16_1

image

Thanks for reporting the issue, this seems to be an out-of-memory issue (f32 KV cache, and 13b params); llama-2-7b-q4f32_1 requires roughly 9 GB, while 13b-q4f16_1 requires roughly 10 GB.

How much RAM does Intel(R) Graphics (ADL GT2) have? Is it 16 GB? It might be a bit hard to catch the OOM error as we've seen earlier in #209.

Similar VK_ERROR_OUT_OF_DEVICE_MEMORY issue was reported in mlc-llm: mlc-ai/mlc-llm#974

I think we should catch GPU out-of-memory errors like we tried previously in #209 (comment).

FYI I was not able to catch them with https://chromewebstore.google.com/detail/webgpu-dev-extension/gkeaijopdhfempknmaggbjbedicopjgm either @greggman.

EDIT: The reason why is because the extension doesn't support workers.

FYI https://webgpureport.org says the integrated GPU memoryHeaps is [ size: 8269717504, properties: DEVICE_LOCAL | HOST_VISIBLE | HOST_COHERENT | HOST_UNCACHED | HOST_CACHED ] which suggests, I believe, this ChromeOS device can use up to 7.7 Go of memory.

@CharlieFRuan Are out-of-memory errors captured somewhere? In WebLLM or Apache TVM?

I know TVM can capture OOM for other backends (e.g. for Vulkan here). I'm not too sure what would be the case for webgpu. I'll make another attempt this week; thanks for the pointers!

I think webllm would needs its own mechanism. There are a few things.

  • First of all, check whether webgpu buffer creation can throw an error, or caught by error scope (based on our previous trial seems it is not always the case)
  • One approach might be to have TVM's GPU adapter to track the buffer allocated/deallocated, and throw after it gets to a cap

@CharlieFRuan Did you have a chance to have a look at this?

@beaufortfrancois I tried to catch the error with CreateBuffer() by adding popErrorScope() in the three places this is called in https://github.com/apache/tvm/blob/main/web/src/webgpu.ts -- no luck with that.

So I instead added a 1024 context length version model for llama-2-7b-q4f32_1, and made them the default choices in the demo page. This lowers the VRAM for ~3GB for llama-2 q4f32. I also added a note about the -1k models (bottom of screenshot) via #377.

image

@beaufortfrancois I tried to catch the error with CreateBuffer() by adding popErrorScope() in the three places this is called in https://github.com/apache/tvm/blob/main/web/src/webgpu.ts -- no luck with that.

According to #356 (comment) logs, it looks like errors happen when validating entries in createBindGroup(), not after createBuffer(). Does it help?

Did you try uncapturederror as well?

device.onuncapturederror = ({error}) => {
  console.log(error);
})

So I instead added a 1024 context length version model for llama-2-7b-q4f32_1, and made them the default choices in the demo page. This lowers the VRAM for ~3GB for llama-2 q4f32. I also added a note about the -1k models (bottom of screenshot) via #377.

That's useful. Thanks!

(gentle ping)

@CharlieFRuan Did you have a chance to look at this?

Sorry for the delay, will take a look tonight

Quick update: it does seem that the error can be caught! Not sure if I did something wrong earlier or there are some updates on the webgpu side.

Since my laptop does not run into OOM for most models, to reproduce the error, I set maxTotalSeqLen to an arbitrary large number 909600, as opposed to default values like 4k or 1k, this forces the engine to allocate a very large KVCache. Not sure if this is equivalent to the behavior of loading a model that is too large for the device (but should be quite similar).

Upon finishing loading the model, the engine will allocate the KVCache, and I see:
image

This log corresponds to the push and pop of ErrorScope I added here in tvm/web:
image

Then upon ignoring the error and start chatting, we hit the uncaptured error you suggested:
image

I will refine the handling and upstream the changes after verifying the errors can indeed be well-caught. Should have another update by the end of this week. Thank you so much for the help!