Specifying a local model

Question

Specifying a local model

unoriginalscreenname opened this issue a year ago · comments

I really like the download model feature that oobabooga uses and wanted to test out linking directly to another model using your model_path and checkpoint_path in the config. However, I can't seem to get it to work. I downloaded TheBloke 13b model via the oobabooga download and linked to the directory and the file using the config variables but i get this error:

learn-langchain\gptq_for_llama\quant\quant_linear.py", line 267, in matmul248
matmul_248_kernel[grid](input, qweight, output, scales, qzeros, g_idx, input.shape[0], qweight.shape[1], input.shape[1], bits, maxq, input.stride(0), input.stride(1), qweight.stride(0),
NameError: name 'matmul_248_kernel' is not defined

I honestly find all of these "model" files super confusing. There are safetensors files, bin files, .pt files. It's a real mess. Do you have any help or tips here? Could you provide an example of linking to a local model file? I think what happens by default is that your code is using the cached hugging face model download.

TestUser · Answer 1 · Fri Apr 28 2023 18:34:07 GMT+0800 (China Standard Time)

Yeah, even after doing a git clone of the suggested model I still get the same error. This was working last night I thought, so I'm not sure what's happening. If I just go with the default 7B configuration it works. Any idea?

Paolo Rechia · Answer 2 · Fri Apr 28 2023 18:38:54 GMT+0800 (China Standard Time)

Hey, it’s very likely that the local model is broken for quantized ones, as I’ve never tested this specific combination.

Or does it also not work with standard unquantized HF models?

It could also be that the whole local model loading is broken, as I stopped using it in the last days to setup the quantized version.

Another thing to keep in mind, is that my code assumes the quant model uses group size 128.

Paolo Rechia · Answer 3 · Fri Apr 28 2023 19:03:29 GMT+0800 (China Standard Time)

Also, can you share which 13b model exactly you downloaded? I could try it out on my end

Paolo Rechia · Answer 4 · Fri Apr 28 2023 20:17:40 GMT+0800 (China Standard Time)

OK, so loading a local HF model was definitely broken, commited a fix here:

78683a7

export MODEL_PATH=vicuna-7b-full-hf/ && uvicorn servers.vicuna_server:app
Using config:  {'base_model_size': '7b', 'use_4bit': False, 'use_fine_tuned_lora': False, 'lora_weights': None, 'device': 'cuda', 'model_path': 'vicuna-7b-full-hf/', 'checkpoint_path': None}
Loading checkpoint shards:   0%| | 0/2 [00:00<?, ?it/s]

I'm gonna loading a local the quantized version next.

Paolo Rechia · Answer 5 · Fri Apr 28 2023 20:42:45 GMT+0800 (China Standard Time)

Loading quant model should now work, fixed a bug there - here's an example I've added to the repo: https://github.com/paolorechia/learn-langchain/blob/main/run_server.sh
Keep in mind this only works for group_size=128, if you need support for other models, send me the model link.
Another thing to keep in mind: if you're running in the same terminal session, the environment variables from previous runs might be conflicting, if things get weird, start a new session or call something like unset MODEL_PATH (or the equivalent in Windows terminal).

Regarding your questions:

I honestly find all of these "model" files super confusing. There are safetensors files, bin files, .pt files. It's a real mess. Do you have any help or tips here? Could you provide an example of linking to a local model file? I think what happens by default is that your code is using the cached hugging face model download.

Safetensor is the newer format supported by GPTQ-For-LLama, does not work in all GPUs apparently.
.pt is just a regular pytorch model, also used by older verisons of the GPTQ-For-LLama library.
The bin files I believe are used by the Hugging Face format.

TestUser · Answer 6 · Fri Apr 28 2023 22:56:17 GMT+0800 (China Standard Time)

Hey, I saw you added some changes. I updated and cleared out the existing models i had downloaded. I downloaded the 13b file as described from https://huggingface.co/TheBloke/vicuna-7B-1.1-GPTQ-4bit-128g. I'm turning on use_13b_model and use_4bit. It did download all the model files to the right directory.

I'm still continuing to get this error:

learn-langchain\venv\lib\site-packages\gptq_for_llama\quant\quant_linear.py", line 267, in matmul248
matmul_248_kernel[grid](input, qweight, output, scales, qzeros, g_idx, input.shape[0], qweight.shape[1], input.shape[1], bits, maxq, input.stride(0), input.stride(1), qweight.stride(0),
NameError: name 'matmul_248_kernel' is not defined

Here's exactly what's getting printed to the console:

Using config: {'base_model_size': '13b', 'use_4bit': True, 'use_fine_tuned_lora': False, 'lora_weights': None, 'device': 'cuda', 'model_path': None, 'model_checkpoint': None}
trioton not installed.
triton not installed.
Loading model vicuna-13B-1.1-GPTQ-4bit-128g checkpoint vicuna-13B-1.1-GPTQ-4bit-128g\vicuna-13B-1.1-GPTQ-4bit-128g.safetensors
Loading model ...
storage = cls(wrap_storage=untyped_storage)
Found 3 unique KN Linear values.
Warming up autotune cache ...
0%| | 0/12 [00:00<?, ?it/s]
Traceback (most recent call last):

...then the error

I should be doing everything right here? If I don't try and load the 13b model and turn off 4bit it all works properly.

TestUser · Answer 7 · Fri Apr 28 2023 23:09:09 GMT+0800 (China Standard Time)

oh, i think this is because i don't have triton.

@triton.jit
def matmul_248_kernel

Paolo Rechia · Answer 8 · Fri Apr 28 2023 23:49:50 GMT+0800 (China Standard Time)

Hey, sorry to have misunderstood it the first time.

So it only happens with the 13b gptq model and you seem to have found that’s it’s related to triton.

Why do you not have triton available, are you on Windows or did my setup miss a dependency?

Sorry to forget the details.

Also if you find more about it, feel free to open a new issue.

TestUser · Answer 9 · Sat Apr 29 2023 04:18:42 GMT+0800 (China Standard Time)

Yeah, I'll continue to investigate. I'm using windows, as I imagine a lot of other folks are so this might be a good thing to figure out.

The Oobabooga (https://github.com/oobabooga/text-generation-webui) repo gets this to work with a forked version of gptq somehow. I'm able to load all these files using their system without a problem, so maybe I can figure out how they're doing it.

TestUser · Answer 10 · Sat Apr 29 2023 04:24:12 GMT+0800 (China Standard Time)

ah it appears that hey use another old Cuda Branch of that library: https://github.com/oobabooga/text-generation-webui/blob/main/docs/GPTQ-models-(4-bit-mode).md

I wonder if I can figure out how to swap that in windows.

Paolo Rechia · Answer 11 · Sat Apr 29 2023 07:06:46 GMT+0800 (China Standard Time)

This may be related to #9

I've partially implemented something to help with this, the requirements.txt now point to a fork of GPTQ-For-Llama, instead of a local directory copy:

gptq-for-llama @ git+https://github.com/paolorechia/GPTQ-for-LLaMa@cadbacf0dcc18f7c56db54561ad53ba0f8db878c

One way I think to address this is to create a fork too of the older, compatible oobabooga fork, and then change this line in the requirements to use the compatible version of the library.

The forked version must be modified to include a setup.py, plus a few code changes, like I did my GPTQ-For-LLama fork.

I might be able to implement it sometime in the future - I'm not planning more coding for the next days. Any chances you would be able to fork it yourself?

TestUser · Answer 12 · Sat Apr 29 2023 09:02:44 GMT+0800 (China Standard Time)

I really like how you've got everything setup in your code base, and I'd love to get this part working on windows. I'll try and figure it out, but I'm not really proficient at this stuff. I've have been known to figure things out from time to time, so we'll see! I would love to contribute something if I can.