RuntimeError: 0 <= device.index() && device.index() < static_cast<c10::DeviceIndex>(device_ready_queues_.size()) INTERNAL ASSERT FAILED at "/build/pytorch/torch/csrc/autograd/engine.cpp":1418

Question

RuntimeError: 0 <= device.index() && device.index() < static_cast<c10::DeviceIndex>(device_ready_queues_.size()) INTERNAL ASSERT FAILED at "/build/pytorch/torch/csrc/autograd/engine.cpp":1418

SoldierWz opened this issue 2 months ago · comments

Describe the bug

When this problem occurred, I tried to disable the CPU core, and then I could run normally, but the running results were very poor, the accuracy dropped sharply and the training time became longer. I have submitted this issue #565. Then when I restored the CPU core, the above error occurred.
Here is the part where the problem occurs.
device = 'xpu'
for train_idx, test_idx in kf.split(X_tensor):
X_train, X_test = X_tensor[train_idx], X_tensor[test_idx]
y_train, y_test = y_tensor[train_idx], y_tensor[test_idx]

train_dataset = CustomDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

model = MLP(X_train.shape[1]) 
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

model = model.to("xpu")
criterion = criterion.to("xpu")
model, optimizer = ipex.optimize(model, optimizer=optimizer)
for epoch in range(1000):
    model.train() 
    for features, labels in train_loader:
        features, labels = features.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(features)
        loss = criterion(outputs, labels)
        **loss.backward()**
        optimizer.step()

Versions

wget https://raw.githubusercontent.com/intel/intel-extension-for-pytorch/master/scripts/collect_env.py

For security purposes, please check the contents of collect_env.py before running it.

python collect_env.py

Jiong Gong · Answer 1 · Tue Mar 26 2024 19:50:13 GMT+0800 (China Standard Time)

May I know what you mean by "disable CPU core"? It sounds like no GPU was found according to the error message. But we should report more meaningful error messages. cc @gujinghui

SoldierWz · Answer 2 · Wed Mar 27 2024 16:26:38 GMT+0800 (China Standard Time)

May I know what you mean by "disable CPU core"? It sounds like no GPU was found according to the error message. But we should report more meaningful error messages. cc @gujinghui

I edited the GRUB configuration file
Change GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
Changed to GRUB_CMDLINE_LINUX_DEFAULT="nohz=off"
There is another line which is GRUB_CMDLINE_LINUX="i915.enable_hangcheck=0" which I did not change.
After editing like this, the GPU can be used
But I just tried and a new problem occurred. The error is reported below.

ImportError Traceback (most recent call last)
Cell In[2], line 8
6 import modin.pandas as pd
7 import numpy as np
----> 8 import torch
9 import intel_extension_for_pytorch as ipex
10 import torch.nn as nn

File ~/mambaforge/envs/pytorch-arc/lib/python3.11/site-packages/torch/init.py:235
233 if USE_GLOBAL_DEPS:
234 _load_global_deps()
--> 235 from torch._C import * # noqa: F403
237 # Appease the type checker; ordinarily this binding is inserted by the
238 # torch._C module initialization code in C
239 if TYPE_CHECKING:

ImportError: /home/wangzhen/mambaforge/envs/pytorch-arc/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so: undefined symbol: iJIT_NotifyEvent