Issue inferencing HuggingFace's GPT-J 4 bits model
webpolis opened this issue · comments
This is a follow up of #371 (comment)
After converting a GPT-J 4 bits model into ggml using the convert-h5-to-ggml.py script, the inferencing fails with the following:
main: seed = 1695659205
gptj_model_load: loading model from 'ggml-model-f16.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx = 2048
gptj_model_load: n_embd = 4096
gptj_model_load: n_head = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot = 64
gptj_model_load: ftype = 1
gptj_model_load: qntvr = 0
gptj_model_load: ggml ctx size = 12438.93 MB
gptj_model_load: memory_size = 896.00 MB, n_mem = 57344
gptj_model_load: tensor 'transformer.h.0.attn.k_proj.weight' has wrong size in model file
main: failed to load model from 'ggml-model-f16.bin'
gptj_model_load:
Apparently, it's duplicating the tensor's size as I added some verbosity here:
https://github.com/ggerganov/ggml/blob/master/examples/gpt-j/main.cpp#L333
The original tensor has 8388608 while ggml expects 16777216:
gptj_model_load: tensor 'transformer.h.0.attn.k_proj.weight' has wrong size in model file (16777216, 8388608)
main: failed to load model from 'ggml-model-f16.bin'
gptj_model_load:
I assume this might be related with the model being 4 bits, but I'm yet not sure what to touch.
I partially solved this but it's generating a bunch of A:
❯ ./build/bin/gpt-j -m ./ggml-model-f16.bin -p "A continuación hay una instrucción que describe una tarea. Proporciona una respuesta que complete adecuadamente la solicitud.\n\n### Instrucción:\nEscribe un poema de 4 versos\n\n### Respuesta:\n" -n 512 --top_p 0.8 --temp 0.2
main: seed = 1696200850
gptj_model_load: loading model from './ggml-model-f16.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx = 2048
gptj_model_load: n_embd = 4096
gptj_model_load: n_head = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot = 64
gptj_model_load: ftype = 1
gptj_model_load: qntvr = 0
gptj_model_load: ggml ctx size = 12438.93 MB
gptj_model_load: memory_size = 896.00 MB, n_mem = 57344
gptj_model_load: ............................ done
gptj_model_load: model size = 11540.60 MB / num tensors = 229
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.
main: number of tokens in prompt = 68
A continuación hay una instrucción que describe una tarea. Proporciona una respuesta que complete adecuadamente la solicitud.\n\n### Instrucción:\nEscribe un poema de 4 versos\n\n### Respuesta:\n
A
A
A
A
A
A
A
A
A
A
My current implementation:
# load quantized version of https://huggingface.co/bertin-project/bertin-gpt-j-6B-alpaca
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
# llm_int8_enable_fp32_cpu_offload=True
)
model = AutoModelForCausalLM.from_pretrained(
'bertin-project/bertin-gpt-j-6B-alpaca',
low_cpu_mem_usage=True,
device_map=device_map, # split between 2 GPUs
torch_dtype='auto',
quantization_config=bnb_config,
use_cache=False
)
Proceed to quantization and export:
cls = bnb.nn.Linear4bit
def write_data(name, d):
orig_data_shape = d.shape
orig_data_size = sys.getsizeof(d)
d = d.to(torch.float16).squeeze().to('cpu').numpy()
n_dims = len(d.shape)
print("Writting: " + name + " with shape: ", d.shape)
print('Original shape: ', orig_data_shape)
print((orig_data_size, sys.getsizeof(d)))
ftype_cur = 0
if ftype != 0:
if name[-7:] == ".weight" and n_dims == 2:
print(" Converting to float16")
d = d.astype(np.float16)
ftype_cur = 1
else:
print(" Converting to float32")
d = d.astype(np.float32)
ftype_cur = 0
else:
if d.dtype != np.float32:
print(" Converting to float32")
d = d.astype(np.float32)
ftype_cur = 0
# header
str = name.encode('utf-8')
fout.write(struct.pack("iii", n_dims, len(str), ftype_cur))
for i in range(n_dims):
fout.write(struct.pack("i", d.shape[n_dims - 1 - i]))
fout.write(str)
# write file
d.tofile(fout)
# dequantize (if required) and export modules
with torch.no_grad():
for orig_name, module in model.named_modules():
if orig_name.endswith("attn.masked_bias") or orig_name.endswith(".attn.bias"):
print(" Skipping variable: " + orig_name)
continue
if isinstance(module, cls):
name = f'{orig_name}.weight'
print(f"Dequantizing `{orig_name}`...")
quant_state = copy.deepcopy(module.weight.quant_state)
# quant_state.dtype = torch.bfloat16
weight_deq = F.dequantize_4bit(
module.weight.data, quant_state=quant_state, quant_type="nf4").to(torch.bfloat16)
write_data(name, weight_deq)
elif f'{orig_name}.weight' in list_vars or \
f'{orig_name}.bias' in list_vars:
if hasattr(module, 'weight'):
name = f'{orig_name}.weight'
data = module.weight.data
write_data(name, data)
if hasattr(module, 'bias'):
name = f'{orig_name}.bias'
data = module.bias.data
write_data(name, data)
fout.close()
Somehow, the embeddings or tokenization is messed up and I can't find the reason.