Tensor support for quantize_nf4

Question

Tensor support for quantize_nf4

YLGH opened this issue 4 months ago · comments

Feature request

Hi,

Is it possible to quantize to nf4 using a blocksize of x.numel()? The largest I found was 4096, (but this actually give me a CUDa error):

import torch
import bitsandbytes

x = torch.randn(16, device="cuda").half()
print(x)
x_nf4, state_nf4 = bitsandbytes.functional.quantize_nf4(x, blocksize=1024)
x_serdes = bitsandbytes.functional.dequantize_nf4(x_nf4, state_nf4)
print(x_serdes)

Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/functional.py", line 933, in quantize_nf4
return quantize_4bit(A, absmax, out, blocksize, compress_statistics, 'nf4', quant_storage)
File "/usr/local/lib/python3.10/dist-packages/bitsandbytes/functional.py", line 981, in quantize_4bit
absmax = torch.zeros((blocks,), device=A.device, dtype=torch.float32)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Motivation

Want to see performance of using tensor size quantizing

Your contribution

Can help with testing/a PR if it's easy to support.

Matthew Douglas · Answer 1 · Fri Mar 29 2024 06:15:13 GMT+0800 (China Standard Time)

blocksize must be in [64, 128, 256, 512, 1024, 2048, 4096].

I am able to reproduce this with blocksize=4096 but not any of the other options. I should have a PR ready to fix that tomorrow after a little more testing.