TorchMoE / MoE-Infinity

PyTorch library for cost-effective, fast and easy serving of MoE models.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RuntimeError: CUDA error: invalid device ordinal. When I run script.py, I meet the error below.

Tingberer opened this issue · comments

Fetching 267 files: 100%
 267/267 [00:00<00:00,  7.56it/s]
[WARNING] FlashAttention is not available in the current environment. Using default attention.
Time to load prefetch op: 3.0545926094055176 seconds
Creating model from scratch ...
Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root...
No modifications detected for re-loaded extension module prefetch, skipping build step...
Loading extension module prefetch...
Loading checkpoint files: 0%| | 0/257 [00:00<?, ?it/s]

RuntimeError Traceback (most recent call last)
in <cell line: 17>()
15 }
16
---> 17 model = MoE(checkpoint, config)
18
19 input_text = "translate English to German: How old are you?"

1 frames
/usr/local/lib/python3.10/dist-packages/moe_infinity/runtime/model_offload.py in archer_from_pretrained(cls, *args, **kwargs)
405 # convert all tensors in state_dict to self.dtype
406 for k, v in state_dict.items():
--> 407 state_dict[k] = v.to(self.dtype).to("cpu")
408
409 self._offload_state_dict(state_dict, empty_state_dict)

RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Is the possible to provide the script to reproduce? If this is one of the example, please specify which one you have run. Providing hardware settings might also be helpful

I have fix it. But when I run readme_example.py, I meet the problew below.
My hardware is 4*RTX3090

/home/admin/anaconda3/envs/moe-infinity/lib/python3.9/site-packages/transformers/generation/utils.py:1249: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
/home/admin/anaconda3/envs/moe-infinity/lib/python3.9/site-packages/transformers/generation/utils.py:1797: UserWarning: You are calling .generate() with the `input_ids` being on a device type different than your model's device. `input_ids` is on cuda, whereas the model is on cpu. You may experience unexpected behaviors or slower generation. Please make sure that you have put `input_ids` to the correct device by calling for example input_ids = input_ids.to('cpu') before running `.generate()`.
  warnings.warn(
Model create:  20%|████████████████████▋                                                                                 | 930/4578 [00:16<00:01, 2992.65it/s]translate English to German: You are Germany?

# Translate English to German


ArcherTaskPool destructor