Error invalid device ordinal at line 393

Question

Error invalid device ordinal at line 393

matt-seb-ho opened this issue 9 months ago · comments

Hi, thanks for releasing and supporting this package! I think the results are super impressive, so I'm trying to get the quantization benefits for my own projects. I am trying QLoRA on Llama-7B. Using a slightly modified finetune.sh script, I'm hitting the following error:

...
Adding special tokens.
adding LoRA modules...
loaded model
Splitting train dataset in train and validation according to `eval_dataset_size`                                    
Found cached dataset json (/mnt/hdd/msho/.cache/huggingface/datasets/json/default-449d839f7091c29e/0.0.0/e347ab1c932
092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)
100%|████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 494.09it/s]
trainable params: 79953920.0 || all params: 3660328960 || trainable: 2.1843370056007205
torch.float32 422326272 0.11537932153507864
torch.uint8 3238002688 0.8846206784649213
  0%|                                                                                     | 0/10000 [00:00<?, ?it/s]
Error invalid device ordinal at line 393 in file /mmfs1/gscratch/zlab/timdettmers/git/bitsandbytes/csrc/pythonInterf
ace.c

I'm working with a server with CUDA 11.7 (top of nvidia-smi readout:)

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+

and I'm fairly certain I have the right dependencies. My pip freeze matches requirements.txt in all places except bitsandbytes where I'm using a slightly newer version (0.41) because 0.40 fails on import for me (detects CUDA 10 for some reason).

bitsandbytes==0.41.1
transformers==4.31.0
peft==0.4.0
accelerate==0.21.0
einops==0.6.1
evaluate==0.4.0
scikit-learn==1.2.2
sentencepiece==0.1.99
wandb==0.15.3

I'm aware that there was a similar issue in the past (#3) but it seems that was resolved, so I'm not sure why I am facing a similar problem.

Any suggestions?
Thanks in advance!