ggerganov / ggml

Tensor library for machine learning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Issue inferencing HuggingFace's GPT-J 4 bits model

webpolis opened this issue · comments

This is a follow up of #371 (comment)

After converting a GPT-J 4 bits model into ggml using the convert-h5-to-ggml.py script, the inferencing fails with the following:

main: seed = 1695659205
gptj_model_load: loading model from 'ggml-model-f16.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: ftype   = 1
gptj_model_load: qntvr   = 0
gptj_model_load: ggml ctx size = 12438.93 MB
gptj_model_load: memory_size =   896.00 MB, n_mem = 57344
gptj_model_load: tensor 'transformer.h.0.attn.k_proj.weight' has wrong size in model file
main: failed to load model from 'ggml-model-f16.bin'
gptj_model_load: 

Apparently, it's duplicating the tensor's size as I added some verbosity here:

https://github.com/ggerganov/ggml/blob/master/examples/gpt-j/main.cpp#L333

The original tensor has 8388608 while ggml expects 16777216:

gptj_model_load: tensor 'transformer.h.0.attn.k_proj.weight' has wrong size in model file (16777216, 8388608)
main: failed to load model from 'ggml-model-f16.bin'
gptj_model_load:

I assume this might be related with the model being 4 bits, but I'm yet not sure what to touch.

I partially solved this but it's generating a bunch of A:

❯ ./build/bin/gpt-j -m ./ggml-model-f16.bin -p "A continuación hay una instrucción que describe una tarea. Proporciona una respuesta que complete adecuadamente la solicitud.\n\n### Instrucción:\nEscribe un poema de 4 versos\n\n### Respuesta:\n" -n 512 --top_p 0.8 --temp 0.2
main: seed = 1696200850
gptj_model_load: loading model from './ggml-model-f16.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: ftype   = 1
gptj_model_load: qntvr   = 0
gptj_model_load: ggml ctx size = 12438.93 MB
gptj_model_load: memory_size =   896.00 MB, n_mem = 57344
gptj_model_load: ............................ done
gptj_model_load: model size = 11540.60 MB / num tensors = 229
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.
main: number of tokens in prompt = 68

A continuación hay una instrucción que describe una tarea. Proporciona una respuesta que complete adecuadamente la solicitud.\n\n### Instrucción:\nEscribe un poema de 4 versos\n\n### Respuesta:\n
A
A
A
A
A
A
A
A
A
A

My current implementation:

# load quantized version of https://huggingface.co/bertin-project/bertin-gpt-j-6B-alpaca
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    # llm_int8_enable_fp32_cpu_offload=True
)
model = AutoModelForCausalLM.from_pretrained(
    'bertin-project/bertin-gpt-j-6B-alpaca',
    low_cpu_mem_usage=True,
    device_map=device_map, # split between 2 GPUs
    torch_dtype='auto',
    quantization_config=bnb_config,
    use_cache=False
)

Proceed to quantization and export:

cls = bnb.nn.Linear4bit


def write_data(name, d):
    orig_data_shape = d.shape
    orig_data_size = sys.getsizeof(d)
    d = d.to(torch.float16).squeeze().to('cpu').numpy()
    n_dims = len(d.shape)

    print("Writting: " + name + " with shape: ", d.shape)
    print('Original shape: ', orig_data_shape)
    print((orig_data_size, sys.getsizeof(d)))

    ftype_cur = 0
    if ftype != 0:
        if name[-7:] == ".weight" and n_dims == 2:
            print("  Converting to float16")
            d = d.astype(np.float16)
            ftype_cur = 1
        else:
            print("  Converting to float32")
            d = d.astype(np.float32)
            ftype_cur = 0
    else:
        if d.dtype != np.float32:
            print("  Converting to float32")
            d = d.astype(np.float32)
            ftype_cur = 0

    # header
    str = name.encode('utf-8')
    fout.write(struct.pack("iii", n_dims, len(str), ftype_cur))
    for i in range(n_dims):
        fout.write(struct.pack("i", d.shape[n_dims - 1 - i]))
    fout.write(str)

    # write file
    d.tofile(fout)

# dequantize (if required) and export modules
with torch.no_grad():
    for orig_name, module in model.named_modules():
        if orig_name.endswith("attn.masked_bias") or orig_name.endswith(".attn.bias"):
            print("  Skipping variable: " + orig_name)
            continue

        if isinstance(module, cls):
            name = f'{orig_name}.weight'
            print(f"Dequantizing `{orig_name}`...")

            quant_state = copy.deepcopy(module.weight.quant_state)
            # quant_state.dtype = torch.bfloat16
            weight_deq = F.dequantize_4bit(
                module.weight.data, quant_state=quant_state, quant_type="nf4").to(torch.bfloat16)

            write_data(name, weight_deq)
        elif f'{orig_name}.weight' in list_vars or \
                f'{orig_name}.bias' in list_vars:
            if hasattr(module, 'weight'):
                name = f'{orig_name}.weight'
                data = module.weight.data

                write_data(name, data)

            if hasattr(module, 'bias'):
                name = f'{orig_name}.bias'
                data = module.bias.data

                write_data(name, data)

fout.close()

Somehow, the embeddings or tokenization is messed up and I can't find the reason.