Met segment fault while running Whisper on Arc

Question

Met segment fault while running Whisper on Arc

Ruoyu-y opened this issue 7 months ago · comments

Configuration:

OS: Ubuntu 24.04 
CPU: 12th Gen Intel(R) Core(TM) i9-12900K
Memory: 16G
GPU:  04:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A770] (rev 08)
software:
    torch                                    2.1.0a0+cxx11.abi
    intel-extension-for-pytorch           2.1.10+xpu
    ipex-llm                                  2.2.0b20250322
    bigdl-core-xe-21                    2.6.0b20250322

Issue met:
run whisper with command python ./recognize.py and get segment fault error

Logs:

$ python recognize.py
/home/cloud/ruoyu/miniforge3/envs/llm/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/cloud/ruoyu/miniforge3/envs/llm/lib/python3.11/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
2025-03-25 09:18:43,572 - INFO - intel_extension_for_pytorch auto imported
2025-03-25 09:18:43,855 - INFO - PyTorch version 2.1.0a0+cxx11.abi available.
step1:
/home/cloud/ruoyu/miniforge3/envs/llm/lib/python3.11/site-packages/huggingface_hub/file_download.py:797: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
2025-03-25 09:18:46,419 - INFO - Converting the current model to sym_int4 format......

LIBXSMM_VERSION: main_stable-1.17-3651 (25693763)
LIBXSMM_TARGET: adl [12th Gen Intel(R) Core(TM) i9-12900K]
Registry and code: 13 MB
Command: python recognize.py
Uptime: 3.432546 s
Segmentation fault

Ruoyu Ying · Answer 1 · Tue Mar 25 2025 21:09:26 GMT+0800 (China Standard Time)

Any hint for this issue? Or recommended configuration?

Kai Huang · Answer 2 · Wed Mar 26 2025 09:25:02 GMT+0800 (China Standard Time)

Hi,

May I ask if this segment fault only exists for whisper or it also exists in running other models https://github.com/intel/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/LLM ?

Also, you may use our script to check the environment so that we can better help to detect the issue: https://github.com/intel/ipex-llm/tree/main/python/llm/scripts#usage

Ruoyu Ying · Answer 3 · Wed Mar 26 2025 11:10:53 GMT+0800 (China Standard Time)

Hi,

May I ask if this segment fault only exists for whisper or it also exists in running other models https://github.com/intel/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/LLM ?

Also, you may use our script to check the environment so that we can better help to detect the issue: https://github.com/intel/ipex-llm/tree/main/python/llm/scripts#usage

I found other LLMs also returns segment fault error. But it works with docker container. Here's the output of that environment check script:

$ bash env-check.sh
-----------------------------------------------------------------
PYTHON_VERSION=3.11.11
-----------------------------------------------------------------
transformers=4.36.2
-----------------------------------------------------------------
torch=2.1.0a0+cxx11.abi
-----------------------------------------------------------------
ipex-llm Version: 2.2.0b20250322
-----------------------------------------------------------------
ipex=2.1.10+xpu
-----------------------------------------------------------------
CPU Information:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               24
On-line CPU(s) list:                  0-23
Vendor ID:                            GenuineIntel
Model name:                           12th Gen Intel(R) Core(TM) i9-12900K
CPU family:                           6
Model:                                151
Thread(s) per core:                   2
Core(s) per socket:                   16
Socket(s):                            1
Stepping:                             2
CPU(s) scaling MHz:                   22%
CPU max MHz:                          5200.0000
CPU min MHz:                          800.0000
-----------------------------------------------------------------
Total CPU Memory: 15.3286 GB
Memory Type: DDR5
-----------------------------------------------------------------
Operating System:
Ubuntu 24.04 LTS \n \l

-----------------------------------------------------------------
Linux cloudgpu 6.8.0-52-generic #53-Ubuntu SMP PREEMPT_DYNAMIC Sat Jan 11 00:06:25 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
CLI:
    Version: 1.2.39.20240906
    Build ID: 11f3c29a

Service:
    Version: 1.2.39.20240906
    Build ID: 11f3c29a
    Level Zero Version: 1.17.0
-----------------------------------------------------------------
  Driver Version                                  2023.16.12.0.12_195853.xmain-hotfix
  Driver Version                                  2023.16.12.0.12_195853.xmain-hotfix
-----------------------------------------------------------------
Driver related package version:
ii  intel-fw-gpu                                     2024.17.5-329~22.04                      all          Firmware package for Intel integrated and discrete GPUs
ii  intel-level-zero-gpu                             1.3.29735.27-914~22.04                   amd64        Intel(R) Graphics Compute Runtime for oneAPI Level Zero.
ii  intel-level-zero-gpu-raytracing                  1.0.0-60~u22.04                          amd64        Level Zero Ray Tracing Support library
-----------------------------------------------------------------
igpu not detected
-----------------------------------------------------------------
xpu-smi is properly installed.
-----------------------------------------------------------------
No device discovered
GPU0 Memory ize=256M
-----------------------------------------------------------------
04:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A770] (rev 08) (prog-if 00 [VGA controller])
        Subsystem: Shenzhen Gunnir Technology Development Co., Ltd DG2 [Arc A770]
        Flags: bus master, fast devsel, latency 0, IRQ 234, IOMMU group 20
        Memory at 86000000 (64-bit, non-prefetchable) [size=16M]
        Memory at 4050000000 (64-bit, prefetchable) [size=256M]
        Expansion ROM at 87000000 [disabled] [size=2M]
        Capabilities: <access denied>
        Kernel driver in use: i915
        Kernel modules: i915, xe
-----------------------------------------------------------------

Is there anything wrong in the configuration?

Ruoyu Ying · Answer 4 · Wed Mar 26 2025 15:14:07 GMT+0800 (China Standard Time)

To provide more details, on the same machine, i could run the inference service in docker according to the guide https://github.com/intel/ipex-llm/blob/main/docs/mddocs/DockerGuides/vllm_docker_quickstart.md. But i cannot run the whisper or other LLMs under python/llm/example/GPU/HuggingFace/LLM folder on my host. I also tried to run the whisper python file inside the docker container bring up following the previous guide, it failed as well. Please help to take a look @hkvision, thanks a lot!

Kai Huang · Answer 5 · Wed Mar 26 2025 16:11:50 GMT+0800 (China Standard Time)

Hi, we checked your env, the following part might have issues.

-----------------------------------------------------------------
No device discovered
GPU0 Memory ize=256M

Could you use sycl-ls and xpu-smi discovery to confirm if the Arc device is properly detected? Thanks!

Ruoyu Ying · Answer 6 · Thu Mar 27 2025 10:11:26 GMT+0800 (China Standard Time)

xpu-smi discovery

xpu-smi discovery returns No device discovered. But i could found the Arc card using lspci. As i am using the in-tree driver in the ubuntu 24.04, will that cause the issue? @hkvision

Kai Huang · Answer 7 · Thu Mar 27 2025 15:06:48 GMT+0800 (China Standard Time)

From your lspci result below, seems the memory 256M is not correct, should be 16G? Maybe can you check if your card is settled properly (e.g. resize bar)? Also is the result of sycl-ls as expected on your machine?

04:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A770] (rev 08) (prog-if 00 [VGA controller])
        Subsystem: Shenzhen Gunnir Technology Development Co., Ltd DG2 [Arc A770]
        Flags: bus master, fast devsel, latency 0, IRQ 234, IOMMU group 20
        Memory at 86000000 (64-bit, non-prefetchable) [size=16M]
        Memory at 4050000000 (64-bit, prefetchable) [size=256M]
        Expansion ROM at 87000000 [disabled] [size=2M]
        Capabilities: <access denied>
        Kernel driver in use: i915
        Kernel modules: i915, xe

Ruoyu Ying · Answer 8 · Fri Mar 28 2025 08:51:56 GMT+0800 (China Standard Time)

From your lspci result below, seems the memory 256M is not correct, should be 16G? Maybe can you check if your card is settled properly (e.g. resize bar)? Also is the result of sycl-ls as expected on your machine?

04:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A770] (rev 08) (prog-if 00 [VGA controller])
        Subsystem: Shenzhen Gunnir Technology Development Co., Ltd DG2 [Arc A770]
        Flags: bus master, fast devsel, latency 0, IRQ 234, IOMMU group 20
        Memory at 86000000 (64-bit, non-prefetchable) [size=16M]
        Memory at 4050000000 (64-bit, prefetchable) [size=256M]
        Expansion ROM at 87000000 [disabled] [size=2M]
        Capabilities: <access denied>
        Kernel driver in use: i915
        Kernel modules: i915, xe

I found no Arc shown in the sycl-ls result. How shall i fix the issue? I used to run ipex-llm inside docker and in that way, i could find Arc using sycl-ls

Kai Huang · Answer 9 · Fri Mar 28 2025 10:04:19 GMT+0800 (China Standard Time)

We suppose this is not an ipex-llm issue but probably due to driver related packages.
You may refer to https://dgpu-docs.intel.com/driver/client/overview.html#installing-client-gpus-on-ubuntu-desktop-24-04-lts for the driver guide.
The environment of the docker (Ubuntu 22.04) is here: https://github.com/intel/ipex-llm/blob/main/docker/llm/serving/xpu/docker/Dockerfile

Ruoyu Ying · Answer 10 · Thu Apr 03 2025 09:58:09 GMT+0800 (China Standard Time)

We suppose this is not an ipex-llm issue but probably due to driver related packages. You may refer to https://dgpu-docs.intel.com/driver/client/overview.html#installing-client-gpus-on-ubuntu-desktop-24-04-lts for the driver guide. The environment of the docker (Ubuntu 22.04) is here: https://github.com/intel/ipex-llm/blob/main/docker/llm/serving/xpu/docker/Dockerfile

I follow the guide that you provided to install the driver again. Using the command 'clinfo | grep "770"' that provided at the end of the tutorial, i could see the device shown. Then i tried to install other dependencies according to the doc https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quickstart/install_linux_gpu.md#install-oneapi, everything seems fine. Then at last, i still met that segment fault. Any other suggestions?

Ruoyu Ying · Answer 11 · Thu Apr 03 2025 10:04:24 GMT+0800 (China Standard Time)

Or is there any example to run whisper in a docker container?

Ruoyu Ying · Answer 12 · Thu Apr 03 2025 16:49:57 GMT+0800 (China Standard Time)

It would be better if i could run it in docker. Here's the error message i got when running the example inside a docker container:

After installing the dependency with pip install trl, i got another error:

Ruoyu Ying · Answer 13 · Fri Apr 04 2025 09:07:16 GMT+0800 (China Standard Time)

Thanks for the guidance. Issue has been resolved

Kai Huang · Answer 14 · Mon Apr 07 2025 09:07:28 GMT+0800 (China Standard Time)

Synced offline, pip install trl==0.11.0 solves the problem.
Feel free to tell us if there are further issues later :)

Jason Dai · Answer 15 · Mon Apr 07 2025 09:29:39 GMT+0800 (China Standard Time)

Synced offline, pip install trl==0.11.0 solves the problem. Feel free to tell us if there are further issues later :)

Shall we update the example readme?

Kai Huang · Answer 16 · Mon Apr 07 2025 09:37:14 GMT+0800 (China Standard Time)

Synced offline, pip install trl==0.11.0 solves the problem. Feel free to tell us if there are further issues later :)

Shall we update the example readme?

Sure :)