NVIDIA TX2+jetpack 5+Ubuntu20.4: CUDA Setup failed despite GPU being available

Question

NVIDIA TX2+jetpack 5+Ubuntu20.4: CUDA Setup failed despite GPU being available

qxpBlog opened this issue 4 months ago · comments

System Info

cuda 11.4
_openmp_mutex 4.5 2_gnu https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
accelerate 0.28.0
aiofiles 23.2.1
aiohttp 3.9.3
aiosignal 1.3.1
altair 5.2.0
annotated-types 0.6.0
antlr4-python3-runtime 4.9.3
anyio 4.3.0
arrow 1.3.0
async-timeout 4.0.3
attrs 23.2.0
bitsandbytes 0.42.0
bzip2 1.0.8 hf897c2e_4 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
ca-certificates 2022.9.24 h4fd8a4c_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
certifi 2024.2.2
chardet 5.2.0
charset-normalizer 3.3.2
click 8.1.7
codecarbon 2.3.4
colorama 0.4.6
cycler 0.12.1
DataProperty 1.0.1
datasets 2.18.0
dill 0.3.8
docker-pycreds 0.4.0
exceptiongroup 1.2.0
fastapi 0.110.0
ffmpy 0.3.2
filelock 3.13.1
frozenlist 1.4.1
fsspec 2024.2.0
gitdb 4.0.11
GitPython 3.1.42
gradio 4.23.0
gradio_client 0.14.0
h11 0.14.0
httpcore 1.0.4
httpx 0.27.0
huggingface-hub 0.21.4
idna 3.6
importlib-resources 5.13.0
Jinja2 3.1.3
joblib 1.3.2
jsonlines 4.0.0
jsonschema 4.21.1
jsonschema-specifications 2023.12.1
kiwisolver 1.4.5
ld_impl_linux-aarch64 2.39 ha75b1e8_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libffi 3.4.2 h3557bc0_5 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libgcc-ng 12.2.0 h607ecd0_19 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libgomp 12.2.0 h607ecd0_19 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libnsl 2.0.0 hf897c2e_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libuuid 2.32.1 hf897c2e_1000 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
libzlib 1.2.13 h4e544f5_4 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
markdown-it-py 3.0.0
mbstrdecoder 1.1.3
mdurl 0.1.2
mpmath 1.3.0
multidict 6.0.5
multiprocess 0.70.16
ncurses 6.4 h419075a_0
networkx 3.1
nltk 3.8.1
numexpr 2.8.6
numpy 1.24.4
omegaconf 2.3.0
openssl 3.0.7 h4e544f5_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
orjson 3.9.15
packaging 24.0
pandas 2.0.3
pathvalidate 3.2.0
peft 0.10.0
pillow 10.2.0
pip 22.3.1 pyhd8ed1ab_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
pkgutil_resolve_name 1.3.10
portalocker 2.8.2
prometheus_client 0.20.0
psutil 5.9.8
ptflops 0.7.2.2
py-cpuinfo 9.0.0
pyarrow 15.0.2
pyarrow-hotfix 0.6
pycountry 23.12.11
pydantic 2.6.4
pydantic_core 2.16.3
pydub 0.25.1
Pygments 2.17.2
pynvml 11.5.0
pyparsing 3.1.2
pytablewriter 1.2.0
python 3.8.13 h92ab765_0_cpython https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
python-dateutil 2.9.0.post0
python-multipart 0.0.9
pytz 2024.1
PyYAML 6.0.1
rapidfuzz 3.7.0
readline 8.1.2 h38e3740_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
referencing 0.34.0
regex 2023.12.25
requests 2.31.0
rich 13.7.1
rouge-score 0.1.2
rpds-py 0.18.0
ruff 0.3.4
sacrebleu 1.5.0
safetensors 0.4.2
scikit-learn 1.3.2
semantic-version 2.10.0
sentencepiece 0.2.0
sentry-sdk 1.43.0
setproctitle 1.3.3
setuptools 65.5.1 pyhd8ed1ab_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
shellingham 1.5.4
six 1.16.0
smmap 5.0.1
smoothquant 0.0.1.dev0
sniffio 1.3.1
sqlite 3.41.2 h998d150_0
sqlitedict 2.1.0
starlette 0.36.3
sympy 1.12
tabledata 1.3.3
tcolorpy 0.1.4
threadpoolctl 3.4.0
tiktoken 0.6.0
tk 8.6.12 hd8af866_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
tokenizers 0.15.2
tomlkit 0.12.0
toolz 0.12.1
torch 2.0.0+nv23.5
transformers 4.38.2
typepy 1.3.2
typer 0.10.0
types-python-dateutil 2.9.0.20240316
typing_extensions 4.10.0
tzdata 2024.1
tzdata 2022f h191b570_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
urllib3 2.2.1
uvicorn 0.29.0
wandb 0.16.5
websockets 11.0.3
wheel 0.38.4 pyhd8ed1ab_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
xxhash 3.4.1
xz 5.4.6 h998d150_0
yarl 1.9.4
zlib 1.2.13 h4e544f5_4 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge

Reproduction

When i run the following code:

import os
import sys
import argparse
import accelerate
from accelerate.utils import BnbQuantizationConfig
import torch
import numpy as np
import time
import transformers 
from transformers import GenerationConfig, LlamaForCausalLM, LlamaTokenizer,AutoModel,AutoTokenizer,AutoModelForCausalLM,GPTQConfig
from codecarbon import track_emissions,EmissionsTracker
from LLMPruner.utils.logger import LoggerWithDepth
from transformers.models.opt.modeling_opt import OPTAttention, OPTDecoderLayer, OPTForCausalLM
from ptflops import get_model_complexity_info
from ptflops.pytorch_ops import bn_flops_counter_hook, pool_flops_counter_hook
from LLMPruner.evaluator.ppl import PPLMetric,test_latency_energy
from LLMPruner.models.hf_llama.modeling_llama import LlamaForCausalLM, LlamaRMSNorm, LlamaAttention, LlamaMLP
from LLMPruner.peft import PeftModel
if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"
torch_version = int(torch.__version__.split('.')[1])

def LlamaAttention_counter_hook(module, input, output):
    # (1) Ignore past-key values
    # (2) Assume there is no attention mask
    # Input will be empty in some pytorch version. use output here since input.shape == output.shape
    flops = 0
    q_len = output[0].shape[1]
    linear_dim = output[0].shape[-1]
    num_heads = module.num_heads
    head_dim = module.head_dim

    rotary_flops = 2 * (q_len * num_heads * head_dim) * 2
    attention_flops = num_heads * (q_len * q_len * head_dim + q_len * q_len + q_len * q_len * head_dim) #QK^T + softmax + AttentionV
    linear_flops = 4 * (q_len * linear_dim * num_heads * head_dim) # 4 for q, k, v, o. 
    flops += rotary_flops + attention_flops + linear_flops
    module.__flops__ += int(flops)

def rmsnorm_flops_counter_hook(module, input, output):
    input = input[0]

    batch_flops = np.prod(input.shape)
    batch_flops *= 2
    module.__flops__ += int(batch_flops)


# @track_emissions()
def main(args):
    
    if args.test_mod == 'tuned':
        # 微调过后的模型的延迟和功耗的评估
        pruned_dict = torch.load(args.ckpt, map_location='cpu')
        tokenizer, model = pruned_dict['tokenizer'], pruned_dict['model']
        model = PeftModel.from_pretrained(
            model,
            args.lora_ckpt,
            torch_dtype=torch.float16,
        )
    elif args.test_mod == 'pruned':
        # 剪枝过后的模型的延迟和功耗的评估
        pruned_dict = torch.load(args.ckpt, map_location='cpu')
        tokenizer, model = pruned_dict['tokenizer'], pruned_dict['model']
    elif args.test_mod == 'base':
        model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen1.5-0.5B", torch_dtype="auto", trust_remote_code=True)
        tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-0.5B", trust_remote_code=True)
    
        
        
    model.to(device)      
    # torch.save({
    #     'model': model, 
    #     'tokenizer': tokenizer,
    # }, "/home/iotsc01/xinpengq/LLM-Pruner-main/prune_log/quant/pytorch_model.bin")    
    
      
    print(model.device)
    # model.config.pad_token_id = tokenizer.pad_token_id = 0 
    # model.config.bos_token_id = 1
    # model.config.eos_token_id = 2

    model.eval()
    
    after_pruning_parameters = sum(p.numel() for p in model.parameters())
    print("#parameters: {}".format(after_pruning_parameters))
    
    ppl = test_latency_energy(model, tokenizer, ['wikitext2', 'ptb'], args.max_seq_len, device=device)
    print("PPL after pruning: {}".format(ppl))
    print("Memory Requirement: {} MiB\n".format(torch.cuda.memory_allocated() / 1024 / 1024))


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Tuning Pruned LLaMA (huggingface version)')
    
    parser.add_argument('--base_model', type=str, default="llama-7b-hf", help='base model name')
    parser.add_argument('--ckpt', type=str, default=None)
    parser.add_argument('--lora_ckpt', type=str, default=None)
    parser.add_argument('--max_seq_len', type=int, default=128, help='max sequence length')
    parser.add_argument('--test_mod', type=str, default="tuned", help='choose from [pruned, tuned, base]')
    args = parser.parse_args()

    main(args)

I set the attribution test_mod to base,but the following issues occurred:

/home/jetson/.local/lib/python3.8/site-packages/torchvision-0.13.0-py3.8-linux-aarch64.egg/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: /home/jetson/.local/lib/python3.8/site-packages/torchvision-0.13.0-py3.8-linux-aarch64.egg/torchvision/image.so: undefined symbol: _ZNK3c1010TensorImpl36is_contiguous_nondefault_policy_implENS_12MemoryFormatE
  warn(f"Failed to load image Python extension: {e}")
/home/jetson/archiconda3/envs/llm/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:167: UserWarning: Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes


  warn(msg)
/home/jetson/archiconda3/envs/llm/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:167: UserWarning: /home/jetson/archiconda3/envs/llm did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/home/jetson/archiconda3/envs/llm/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:167: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda-11.4/lib64/libcudart.so.11.0'), PosixPath('/usr/local/cuda-11.4/lib64/libcudart.so')}.. We select the PyTorch default libcudart.so, which is {torch.version.cuda},but this might missmatch with the CUDA version that is needed for bitsandbytes.To override this behavior set the BNB_CUDA_VERSION=<version string, e.g. 122> environmental variableFor example, if you want to use the CUDA version 122BNB_CUDA_VERSION=122 python ...OR set the environmental variable in your .bashrc: export BNB_CUDA_VERSION=122In the case of a manual override, make sure you set the LD_LIBRARY_PATH, e.g.export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.2
  warn(msg)
/home/jetson/archiconda3/envs/llm/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:167: UserWarning: /opt/ros/noetic/lib:/usr/local/cuda-11.4/lib64 did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
  warn(msg)
/home/jetson/archiconda3/envs/llm/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:167: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0'), PosixPath('/usr/local/cuda/lib64/libcudart.so')}.. We select the PyTorch default libcudart.so, which is {torch.version.cuda},but this might missmatch with the CUDA version that is needed for bitsandbytes.To override this behavior set the BNB_CUDA_VERSION=<version string, e.g. 122> environmental variableFor example, if you want to use the CUDA version 122BNB_CUDA_VERSION=122 python ...OR set the environmental variable in your .bashrc: export BNB_CUDA_VERSION=122In the case of a manual override, make sure you set the LD_LIBRARY_PATH, e.g.export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.2
  warn(msg)
False

===================================BUG REPORT===================================
================================================================================
The following directories listed in your path were found to be non-existent: {PosixPath('https'), PosixPath('//hf-mirror.com')}
The following directories listed in your path were found to be non-existent: {PosixPath('//localhost'), PosixPath('http'), PosixPath('11311')}
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...
DEBUG: Possible options found for libcudart.so: {PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0'), PosixPath('/usr/local/cuda/lib64/libcudart.so')}
CUDA SETUP: PyTorch settings found: CUDA_VERSION=114, Highest Compute Capability: 8.7.
CUDA SETUP: To manually override the PyTorch CUDA version please see:https://github.com/TimDettmers/bitsandbytes/blob/main/how_to_use_nonpytorch_cuda.md
CUDA SETUP: Loading binary /home/jetson/archiconda3/envs/llm/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda114.so...
/home/jetson/archiconda3/envs/llm/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda114.so: cannot open shared object file: No such file or directory
CUDA SETUP: Something unexpected happened. Please compile from source:
git clone https://github.com/TimDettmers/bitsandbytes.git
cd bitsandbytes
CUDA_VERSION=114 make cuda11x
python setup.py install
Traceback (most recent call last):
  File "/home/jetson/llm-mian/LLM-Pruner-main/test_latency_energy.py", line 18, in <module>
    from LLMPruner.peft import PeftModel
  File "/home/jetson/llm-mian/LLM-Pruner-main/LLMPruner/peft/__init__.py", line 22, in <module>
    from .mapping import MODEL_TYPE_TO_PEFT_MODEL_MAPPING, PEFT_TYPE_TO_CONFIG_MAPPING, get_peft_config, get_peft_model
  File "/home/jetson/llm-mian/LLM-Pruner-main/LLMPruner/peft/mapping.py", line 16, in <module>
    from .peft_model import (
  File "/home/jetson/llm-mian/LLM-Pruner-main/LLMPruner/peft/peft_model.py", line 31, in <module>
    from .tuners import AdaLoraModel, LoraModel, PrefixEncoder, PromptEmbedding, PromptEncoder
  File "/home/jetson/llm-mian/LLM-Pruner-main/LLMPruner/peft/tuners/__init__.py", line 20, in <module>
    from .lora import LoraConfig, LoraModel
  File "/home/jetson/llm-mian/LLM-Pruner-main/LLMPruner/peft/tuners/lora.py", line 40, in <module>
    import bitsandbytes as bnb
  File "/home/jetson/archiconda3/envs/llm/lib/python3.8/site-packages/bitsandbytes/__init__.py", line 6, in <module>
    from . import cuda_setup, utils, research
  File "/home/jetson/archiconda3/envs/llm/lib/python3.8/site-packages/bitsandbytes/research/__init__.py", line 1, in <module>
    from . import nn
  File "/home/jetson/archiconda3/envs/llm/lib/python3.8/site-packages/bitsandbytes/research/nn/__init__.py", line 1, in <module>
    from .modules import LinearFP8Mixed, LinearFP8Global
  File "/home/jetson/archiconda3/envs/llm/lib/python3.8/site-packages/bitsandbytes/research/nn/modules.py", line 8, in <module>
    from bitsandbytes.optim import GlobalOptimManager
  File "/home/jetson/archiconda3/envs/llm/lib/python3.8/site-packages/bitsandbytes/optim/__init__.py", line 6, in <module>
    from bitsandbytes.cextension import COMPILED_WITH_CUDA
  File "/home/jetson/archiconda3/envs/llm/lib/python3.8/site-packages/bitsandbytes/cextension.py", line 20, in <module>
    raise RuntimeError('''
RuntimeError: 
        CUDA Setup failed despite GPU being available. Please run the following command to get more information:

        python -m bitsandbytes

        Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
        to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
        and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues

Expected behavior

@kashif @stephenroller @akx @jbn @i want to konw how to solve this problem, and if bitsandbytes does not support the TX2.Looking forward to your reply.

Xinpeng Qin commented 4 months ago

thanks

Matthew Douglas · Answer 1 · Fri Mar 29 2024 06:26:08 GMT+0800 (China Standard Time)

Duplicate of #1151. There has not been a bitsandbytes release built for aarch64 yet.