llama_inference 4bits error
gjm441 opened this issue · comments
when I run the script:
CUDA_VISIBLE_DEVICES=0 python llama.py ${MODEL_DIR} c4 --wbits 4 --true-sequential --act-order --groupsize 128 --eval --save llama7b-4bit-128g.pt &>baseline.txt &
I got the same ppl as readme,But when infer with saved int4 weight:
CUDA_VISIBLE_DEVICES=0 python llama_inference.py decapoda-research/llama-7b-hf --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --text "this is llama"
I get the error as follows:
Loading model ...
Found 3 unique KN Linear values.
Warming up autotune cache ...
0%| | 0/12 [00:00<?, ?it/s]/usr/bin/ld: cannot find -lcuda
collect2: error: ld returned 1 exit status
0%| | 0/12 [00:00<?, ?it/s]
Traceback (most recent call last):
File "", line 21, in matmul_248_kernel
KeyError: ('2-.-0-.-0-37ce7529e37ca1a0b8a47b63bc5fd4b0-d6252949da17ceb5f3a278a70250af13-3b85c7bef5f0a641282f3b73af50f599-2d732a2488b7ed996facc3e641ee56bf-3498c340fd4b6ee7805fd54b882a04f5-e1f133f98d04093da2078dfc51c36b72-b26258bf01f839199e39d64851821f26-d7c06e3b46e708006c15224aac7a1378-f585402118c8a136948ce0a49cfe122c', (torch.float16, torch.int32, torch.float16, torch.float16, torch.int32, torch.int32, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), (16, 256, 32, 8), (True, True, True, True, True, True, (False, True), (True, False), (True, False), (False, False), (False, False), (True, False), (False, True), (True, False), (False, True), (True, False), (False, True), (True, False), (True, False)))
During handling of the above exception, another exception occurred: