MilaNLProc / simple-generation

A python package to run inference with HuggingFace language and vision-language checkpoints wrapping many convenient features.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RuntimeError: CUDA error: Unspecified Launch Failure During Generation

donya-rooein opened this issue · comments

When attempting to generate text using a CUDA-enabled model, an unspecified launch failure CUDA error occurs. This error halts the generation process, leading to incomplete or failed batches.

checkpoint = "meta-llama/Llama-2-70b-chat-hf" responses= generator(texts, apply_chat_template=True, skip_prompt=True, batch_size="auto", temperature=0.0, max_new_tokens=256)

output:
Generation: 0%| | 0/96 [00:00<?, ?it/s] Error CUDA error: unspecified launch failure CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with "TORCH_USE_CUDA_DSA" to enable device-side assertions. Generation failed. Skipping batch.

Hello Donya! It seems an error related to DDP. How are you executing the script? Can you also show me the run command? Is it a python ..., accelerate ... or torchrun ...?

Also, I'm noticing you are loading the model without any quantization or dtype specified. That means you are loading the weights in full scale and probably running OOM. Can you try adding to the loader, for example, torch_dtype=torch.bfloat16?

Hello @donya-rooein, did any of the fixes work fine?