Export Phi-2

Question

Export Phi-2

miguel-arrf opened this issue 7 months ago · comments

Hi!

I'm converting the Microsoft's Phi-2 model to use with swift-transformers.

The conversion process is actually very seamless:

from transformers import AutoTokenizer, AutoModelForCausalLM
from exporters.coreml import CoreMLConfig
from exporters.coreml import export

model = "microsoft/phi-2"

# Load tokenizer and PyTorch weights form the Hub
tokenizer = AutoTokenizer.from_pretrained(model, trust_remote_code=True)
pt_model = AutoModelForCausalLM.from_pretrained(model, trust_remote_code=True, torchscript=True)

class Phi2CoreMLConfig(CoreMLConfig):
    modality = "text"


coreml_config = Phi2CoreMLConfig(pt_model.config, task="text-generation")
mlmodel = export(tokenizer, pt_model, coreml_config)
mlmodel.save("Phi2.mlpackage")

Note that by default the export function is using float32.

Then, I'm using the swift-chat repo to run the model. I'm using the Llama-2 tokenizer. It works perfectly well out of the box. There was only one missing token, the 'space' (' '), but apart from that it works.

The issue is that it is super, super slow (I have a MacBook Pro with 16gb RAM and M1) and it's using close to 11GB of memory. Although the inference is slow, the output makes sense.

Given that it is so slow, I converted the model using float16:

mlmodel = export(tokenizer, pt_model, coreml_config, quantize="float16")

The model is now 5GB, but the inference is giving me gibberish (the output was, before, something that made sense, now it's just a bunch of exclamation marks). I downloaded the model (the 5GB one) into my iPhone 14 Pro and after a few seconds, while it is loading, the app just closes itself.

How can I further decrease the model size? Can we quantize the model even more using CoreML?
Why is the inference speed so slow (with the default float32)?
Why is the model with quantize="float16" basically instantaneous, but outputting gibberish?

Thank you so much for the help!

Omkar Malpure commented 5 months ago

Hi

Pedro Cuenca · Answer 1 · Sat Dec 16 2023 03:18:25 GMT+0800 (China Standard Time)

Hello @miguel-arrf!

Thanks a lot for the detailed report, much appreciated 🙌 I agree that Phi-2 is a very exciting model to try! There are additional quantization techniques that we could apply, but I'd suggest we debug float16 first. Let me try to retrace your steps and I'll get back to you soon :)

Regarding speed in float32, it could be for a variety of reasons: perhaps some layers are being scheduled to run on CPU, perhaps the model is using too much memory and your system swaps. I'll take a look too. In addition to that, there are some performance optimization techniques for LLMs (kv caching, in particular) that we are currently working on, and that should help a lot. I'll keep you posted about that as well.

Finally, if you used the latest version of exporters, I believe that the tokenizer should have been picked up automatically by swift-transformers / swift-chat. I'll check that out too.

Omkar Malpure · Answer 2 · Thu Feb 15 2024 13:51:39 GMT+0800 (China Standard Time)

I wanted to know if these swift transformers for phi 2 is available in hugging face

Pedro Cuenca · Answer 3 · Thu Feb 15 2024 15:19:11 GMT+0800 (China Standard Time)

Hi @omkar806: not yet, but soon. We found some problems during conversion of the model. As @miguel-arrf described, float16 inference does not work after conversion, we probably need to keep some layers in float32. I didn't have time to debug in depth, but want to do it soon. We'll post here when it's done.

baozzz1 · Answer 4 · Wed Apr 24 2024 14:40:29 GMT+0800 (China Standard Time)

hi @pcuenca , may I ask that is it finished now?

Pedro Cuenca · Answer 5 · Wed Apr 24 2024 16:51:08 GMT+0800 (China Standard Time)

Working on it this week