support >= 4GB SYCL compute buffer size for longer context length
ytliew82 opened this issue · comments
Describe the bug
The SYCL Unified Shared Memory (USM) type of device memory has maximum constraint of 4 GB. Ipex-llm will report error if the calculated kv cache size is more than 4GB.
How to reproduce
computer setup with igpu only inference and >= 32GB ram, thus expecting no allocation issue with larger context size.
encounter this issue with Gemma-3 model
Steps to reproduce the error:
- configure the -c argument to smaller count
- observe the buffer size reported used for SYCL buffer, safe if less than 4GB
- increase the -c argument till expectation is larger than 4GB. Will getting the reported error on memory allocation issue.
Additional context
Am running gemma 3 model with llama server, thus expecting similar issue for other moe models
declaring multiple SYCL USM device instances might overcome this constraint, to have more than 4GB buffer size for longer context length (few k and above, and case with parallel enabled)
Hi ytliew82,
We previously encountered the same error with Gemma-3 4B on ARC, while Gemma-3 12B seemed to work fine. Are you using the 4B model in your test?
tested with Gemma-3 4B, 12B, having same error on not fit into device buffer.
currently run with cpu only inference as workaround, and limiting the -ngl argument to fit into 4GB device buffer.
anyway, based on my understanding, the USM type of host/device/shared mostly apply for dGPU.
https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2025-0/unified-shared-memory-allocations.html#USM-ALLOCATION
since the IGPU shared the L3 cache with CPU, could we try optionally use shared buffer instead of device buffer? if initialize --device IGPU
Hi ytliew82,
Thank you for the information! We'll provide updates once it's supported.
+1