Gibberish output with `Llama-2-7b-chat-hf-q4f32_1`

Question

Gibberish output with `Llama-2-7b-chat-hf-q4f32_1`

beaufortfrancois opened this issue 2 months ago · comments

François Beaufort commented 2 months ago

Chrome Version: 125.0.6283.3
OS: ChromeOS
GPU: Intel(R) Graphics (ADL GT2) - Intel open-source Mesa driver: Mesa 23.3.0 (git-5cb3f1e4fa)
Dawn Backend: Vulkan

What steps will reproduce the problem?

Go to https://webllm.mlc.ai/#chat-demo
Select Llama-2-7b-chat-hf-q4f32_1
Enter What color is the dress?

What is the expected result?
Some text that at least makes sense.

What happens instead?
Some gibberish text appears.
DevTools JavaScript console contains the following logs:

llm_chat.ts:150 Using prefillChunkSize:  1024
llm_chat.ts:180 Using maxWindowLength:  4096
llm_chat.ts:202 Using Paged KVCache
15vkAllocateMemory failed with VK_ERROR_OUT_OF_DEVICE_MEMORY
    at CheckVkOOMThenSuccessImpl (..<URL>)

15vkAllocateMemory failed with VK_ERROR_OUT_OF_DEVICE_MEMORY
    at CheckVkOOMThenSuccessImpl (..<URL>)

Then I enter "What color is the dress?"

97[Invalid Buffer (unlabeled)] is invalid.
 - While validating entries[0] as a Buffer.
Expected entry layout: { type: BufferBindingType::Storage, hasDynamicOffset: 0, minBindingSize: 0 }
 - While validating [BindGroupDescriptor] against [BindGroupLayout (unlabeled)]
 - While calling [Device].CreateBindGroup([BindGroupDescriptor]).

162[Invalid BindGroup (unlabeled)] is invalid.
 - While encoding [ComputePassEncoder (unlabeled)].SetBindGroup(0, [Invalid BindGroup (unlabeled)], 0, ...).

161[Invalid CommandBuffer] is invalid.
 - While calling [Queue].Submit([[Invalid CommandBuffer]])

97[Invalid Buffer (unlabeled)] is invalid.
 - While validating entries[0] as a Buffer.
Expected entry layout: { type: BufferBindingType::Storage, hasDynamicOffset: 0, minBindingSize: 0 }
 - While validating [BindGroupDescriptor] against [BindGroupLayout (unlabeled)]
 - While calling [Device].CreateBindGroup([BindGroupDescriptor]).

65[Invalid Buffer (unlabeled)] is invalid.
 - While validating entries[0] as a Buffer.
Expected entry layout: { type: BufferBindingType::ReadOnlyStorage, hasDynamicOffset: 0, minBindingSize: 0 }
 - While validating [BindGroupDescriptor] against [BindGroupLayout (unlabeled)]
 - While calling [Device].CreateBindGroup([BindGroupDescriptor]).

162[Invalid BindGroup (unlabeled)] is invalid.
 - While encoding [ComputePassEncoder (unlabeled)].SetBindGroup(0, [Invalid BindGroup (unlabeled)], 0, ...).

161[Invalid CommandBuffer] is invalid.
 - While calling [Queue].Submit([[Invalid CommandBuffer]])

65[Invalid Buffer (unlabeled)] is invalid.
 - While validating entries[0] as a Buffer.
Expected entry layout: { type: BufferBindingType::ReadOnlyStorage, hasDynamicOffset: 0, minBindingSize: 0 }
 - While validating [BindGroupDescriptor] against [BindGroupLayout (unlabeled)]
 - While calling [Device].CreateBindGroup([BindGroupDescriptor]).

WebGPU: too many warnings, no more warnings will be reported to the console for this GPUDevice.
/#chat-demo:1 WebGPU: too many warnings, no more warnings will be reported to the console for this GPUDevice.

Note
It does work properly with the following f16 variants: Llama-2-7b-chat-hf-q4f16_1 and Llama-2-7b-chat-hf-q4f16_1-1k
I can reproduce with Llama-2-13b-chat-hf-q4f16_1

Charlie Ruan · Answer 1 · Thu Apr 04 2024 03:00:23 GMT+0800 (China Standard Time)

Thanks for reporting the issue, this seems to be an out-of-memory issue (f32 KV cache, and 13b params); llama-2-7b-q4f32_1 requires roughly 9 GB, while 13b-q4f16_1 requires roughly 10 GB.

How much RAM does Intel(R) Graphics (ADL GT2) have? Is it 16 GB? It might be a bit hard to catch the OOM error as we've seen earlier in #209.

Charlie Ruan · Answer 2 · Thu Apr 04 2024 03:01:04 GMT+0800 (China Standard Time)

Similar VK_ERROR_OUT_OF_DEVICE_MEMORY issue was reported in mlc-llm: mlc-ai/mlc-llm#974

François Beaufort · Answer 3 · Fri Apr 12 2024 17:27:37 GMT+0800 (China Standard Time)

I think we should catch GPU out-of-memory errors like we tried previously in #209 (comment).

FYI I was not able to catch them with https://chromewebstore.google.com/detail/webgpu-dev-extension/gkeaijopdhfempknmaggbjbedicopjgm either @greggman.

EDIT: The reason why is because the extension doesn't support workers.

François Beaufort · Answer 4 · Fri Apr 12 2024 21:12:42 GMT+0800 (China Standard Time)

FYI https://webgpureport.org says the integrated GPU memoryHeaps is [ size: 8269717504, properties: DEVICE_LOCAL | HOST_VISIBLE | HOST_COHERENT | HOST_UNCACHED | HOST_CACHED ] which suggests, I believe, this ChromeOS device can use up to 7.7 Go of memory.

François Beaufort · Answer 5 · Mon Apr 15 2024 15:53:38 GMT+0800 (China Standard Time)

@CharlieFRuan Are out-of-memory errors captured somewhere? In WebLLM or Apache TVM?

Charlie Ruan · Answer 6 · Tue Apr 16 2024 00:41:50 GMT+0800 (China Standard Time)

I know TVM can capture OOM for other backends (e.g. for Vulkan here). I'm not too sure what would be the case for webgpu. I'll make another attempt this week; thanks for the pointers!

François Beaufort · Answer 7 · Tue Apr 16 2024 16:03:20 GMT+0800 (China Standard Time)

https://github.com/search?q=repo%3Aapache%2Ftvm+%22out-of-memory%22&type=code returns no results for me ;(

Tianqi Chen · Answer 8 · Tue Apr 16 2024 20:33:23 GMT+0800 (China Standard Time)

I think webllm would needs its own mechanism. There are a few things.

First of all, check whether webgpu buffer creation can throw an error, or caught by error scope (based on our previous trial seems it is not always the case)
One approach might be to have TVM's GPU adapter to track the buffer allocated/deallocated, and throw after it gets to a cap

François Beaufort · Answer 9 · Mon Apr 22 2024 15:27:33 GMT+0800 (China Standard Time)

@CharlieFRuan Did you have a chance to have a look at this?

Charlie Ruan · Answer 10 · Tue Apr 23 2024 01:14:04 GMT+0800 (China Standard Time)

@beaufortfrancois I tried to catch the error with CreateBuffer() by adding popErrorScope() in the three places this is called in https://github.com/apache/tvm/blob/main/web/src/webgpu.ts -- no luck with that.

So I instead added a 1024 context length version model for llama-2-7b-q4f32_1, and made them the default choices in the demo page. This lowers the VRAM for ~3GB for llama-2 q4f32. I also added a note about the -1k models (bottom of screenshot) via #377.

François Beaufort · Answer 11 · Tue Apr 23 2024 18:59:13 GMT+0800 (China Standard Time)

@beaufortfrancois I tried to catch the error with CreateBuffer() by adding popErrorScope() in the three places this is called in https://github.com/apache/tvm/blob/main/web/src/webgpu.ts -- no luck with that.

According to #356 (comment) logs, it looks like errors happen when validating entries in createBindGroup(), not after createBuffer(). Does it help?

Did you try uncapturederror as well?

device.onuncapturederror = ({error}) => {
  console.log(error);
})

So I instead added a 1024 context length version model for llama-2-7b-q4f32_1, and made them the default choices in the demo page. This lowers the VRAM for ~3GB for llama-2 q4f32. I also added a note about the -1k models (bottom of screenshot) via #377.

That's useful. Thanks!

François Beaufort · Answer 12 · Thu May 02 2024 13:40:28 GMT+0800 (China Standard Time)

(gentle ping)

François Beaufort · Answer 13 · Mon May 13 2024 17:58:39 GMT+0800 (China Standard Time)

@CharlieFRuan Did you have a chance to look at this?

Charlie Ruan · Answer 14 · Tue May 14 2024 13:43:08 GMT+0800 (China Standard Time)

Sorry for the delay, will take a look tonight

Charlie Ruan · Answer 15 · Wed May 15 2024 04:04:20 GMT+0800 (China Standard Time)

Quick update: it does seem that the error can be caught! Not sure if I did something wrong earlier or there are some updates on the webgpu side.

Since my laptop does not run into OOM for most models, to reproduce the error, I set maxTotalSeqLen to an arbitrary large number 909600, as opposed to default values like 4k or 1k, this forces the engine to allocate a very large KVCache. Not sure if this is equivalent to the behavior of loading a model that is too large for the device (but should be quite similar).

Upon finishing loading the model, the engine will allocate the KVCache, and I see:

This log corresponds to the push and pop of ErrorScope I added here in tvm/web:

Then upon ignoring the error and start chatting, we hit the uncaptured error you suggested:

I will refine the handling and upstream the changes after verifying the errors can indeed be well-caught. Should have another update by the end of this week. Thank you so much for the help!