LLMEval not loading Qwen1.5 -0.5B model in to memory

Question

LLMEval not loading Qwen1.5 -0.5B model in to memory

mobile-appz opened this issue 4 months ago · comments

When trying to load Qwen1.5, the model downloads fully but doesn't appear to load in to memory on MacOS or iOS. After typing a prompt, the error output is "Failed: unhandledKeys(base: "Embedding", keys: ["biases", "scales"])

Using MLX 0.11.0

Other linked models work as per the repo code but this is the smallest, which looks like best one for older devices with less RAM and would be great to get it working.

Awni Hannun · Answer 1 · Tue Apr 23 2024 21:06:01 GMT+0800 (China Standard Time)

Right, we changed quantization in MLX core so now the embedding layer is quantized. We'll need to update Swift to do the same.

mobile-appz · Answer 2 · Wed Apr 24 2024 03:05:42 GMT+0800 (China Standard Time)

Right, we changed quantization in MLX core so now the embedding layer is quantized. We'll need to update Swift to do the same.

Thanks for the info. I was totally unsure as to the cause of this error message. To update, I tried to load this in LLM tool in mlx-swift-examples and that failed with the same error. I then tried to run the python code in the mlx-examples and the model did load and process a prompt. However, the output wasn't worthwhile for anything apparently useful, probably because the model is so small.

David Koski · Answer 3 · Fri Apr 26 2024 04:17:40 GMT+0800 (China Standard Time)

I think these are the commits in question:

Awni Hannun · Answer 4 · Fri Apr 26 2024 04:35:53 GMT+0800 (China Standard Time)

Those are the commits. Sorry that broke more stuff than I was expecting. Basically the embeddings are default quantized now. So when we quantize for MLX in python the model is not usable in Swift because it doesn't support quantized embeddings.

The medium term solution is to update Swift to quantize embeddings (this is a swift only change, don't need anything from core). But as a temporary patch, we could also upload models without embedding layers quantized.

mobile-appz · Answer 5 · Fri Apr 26 2024 05:01:50 GMT+0800 (China Standard Time)

At least 1 small model, that can run on an older iOS 17 compatible iPhone, without embedding layers quantized would be really useful for experimentation purposes. Thanks.

David Koski · Answer 6 · Fri Apr 26 2024 06:00:26 GMT+0800 (China Standard Time)

Those are the commits. Sorry that broke more stuff than I was expecting. Basically the embeddings are default quantized now. So when we quantize for MLX in python the model is not usable in Swift because it doesn't support quantized embeddings.

The medium term solution is to update Swift to quantize embeddings (this is a swift only change, don't need anything from core). But as a temporary patch, we could also upload models without embedding layers quantized.

If we make this change will it break other models that don't have the quantized embeddings (all the models we have been using to date)? I wonder if we need some way to detect and switch between these modes?

Awni Hannun · Answer 7 · Fri Apr 26 2024 06:15:29 GMT+0800 (China Standard Time)

Right, so this is what solves that problem in MLX: https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/utils.py#L336-L346

It's actually really useful because it handles heterogeneously quantized models very cleanly which is a problem we've had in the past (e.g. old models with unquantized gate matrices or unquantized LM heads prior to when we supported more sizes).

David Koski · Answer 8 · Fri Apr 26 2024 06:37:42 GMT+0800 (China Standard Time)

Aha, I didn't implement that -- we have just been using the load safetensors function and the update parameters method.

implement load_model with quantization support (here)
implement embedding quantization (here)
adopt in mlx-swift-examples (here)

Awni Hannun · Answer 9 · Fri Apr 26 2024 06:39:30 GMT+0800 (China Standard Time)

we have just been using the load safetensors function and the update parameters method.

But how do you know if it's a quantized model or not? Presumably there are some loc somewhere that quantizes the model based on the config? (prior to loading the safetensors)

David Koski · Answer 10 · Fri Apr 26 2024 06:40:55 GMT+0800 (China Standard Time)

we have just been using the load safetensors function and the update parameters method.

But how do you know if it's a quantized model or not? Presumably there are some loc somewhere that quantizes the model based on the config? (prior to loading the safetensors)

The config file indicates it -- I am pretty sure this is how the mlx_lm code (or maybe the predecessor) worked and I just copied that, but perhaps that has moved forward.

Awni Hannun · Answer 11 · Fri Apr 26 2024 06:46:32 GMT+0800 (China Standard Time)

This is what I'm referring to:

https://github.com/ml-explore/mlx-swift-examples/blob/main/Libraries/LLM/Load.swift#L58-L60

MLX LM has always had something like that. It builds the quantized model based on the config. The premise didn't change much. Only two things really:

Quantize all Linear and Embedding models by default
Of those only quantize modules which have a "scales" parameter in their weights

Awni Hannun · Answer 12 · Fri Apr 26 2024 06:48:07 GMT+0800 (China Standard Time)

It looks like you added some edge case handling already in there (e.g. https://github.com/ml-explore/mlx-swift-examples/blob/main/Libraries/LLM/Load.swift#L97-L108). The update to MLX LM simplified that kind of stuff a bit.

David Koski · Answer 13 · Fri Apr 26 2024 07:01:47 GMT+0800 (China Standard Time)

It looks like you added some edge case handling already in there (e.g. https://github.com/ml-explore/mlx-swift-examples/blob/main/Libraries/LLM/Load.swift#L97-L108). The update to MLX LM simplified that kind of stuff a bit.

Yeah, that is actually a port of the python code, so I must have got things in the middle.

The load_model method probably should have been implemented from the start but I never used it and it just got lost.

Now I think we have a good idea of what needs to be done here.

solume · Answer 14 · Sat May 18 2024 21:59:09 GMT+0800 (China Standard Time)

Is there a temporary solution to this? running into the same issue with openELM, but that doesnt seem to be supported for <0.11.0

mobile-appz · Answer 15 · Sun May 19 2024 02:45:42 GMT+0800 (China Standard Time)

Is there a temporary solution to this? running into the same issue with openELM, but that doesnt seem to be supported for <0.11.0

Perhaps this will help below? I think this fixes the error with openELM, you may need to update to all the latest versions of MLX and get the latest version of the mlx-swift-examples.

#63

solume · Answer 16 · Sun May 19 2024 08:36:49 GMT+0800 (China Standard Time)

It does support the unquantized model, but it breaks when using the quantized model with the same error as above ("Failed: unhandledKeys(base: "Embedding", keys: ["biases", "scales"]), and downgrading doesn't seem to work either because then openELM wasn't supported yet. Are there any patches for this embedding quantization mismatch?

David Koski · Answer 17 · Sun May 19 2024 09:47:56 GMT+0800 (China Standard Time)

Not yet, but it is next on my list.

David Koski · Answer 18 · Wed May 29 2024 07:35:53 GMT+0800 (China Standard Time)

#76 should fi this

mobile-appz · Answer 19 · Wed May 29 2024 22:05:07 GMT+0800 (China Standard Time)

#76 should fi this

Thank you very much for fixing this, it's working now