huggingface / exporters

Export Hugging Face models to Core ML and TensorFlow Lite

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Export Phi-2

miguel-arrf opened this issue · comments

Hi!

I'm converting the Microsoft's Phi-2 model to use with swift-transformers.

The conversion process is actually very seamless:

from transformers import AutoTokenizer, AutoModelForCausalLM
from exporters.coreml import CoreMLConfig
from exporters.coreml import export

model = "microsoft/phi-2"

# Load tokenizer and PyTorch weights form the Hub
tokenizer = AutoTokenizer.from_pretrained(model, trust_remote_code=True)
pt_model = AutoModelForCausalLM.from_pretrained(model, trust_remote_code=True, torchscript=True)

class Phi2CoreMLConfig(CoreMLConfig):
    modality = "text"


coreml_config = Phi2CoreMLConfig(pt_model.config, task="text-generation")
mlmodel = export(tokenizer, pt_model, coreml_config)
mlmodel.save("Phi2.mlpackage")

Note that by default the export function is using float32.

Then, I'm using the swift-chat repo to run the model. I'm using the Llama-2 tokenizer. It works perfectly well out of the box. There was only one missing token, the 'space' (' '), but apart from that it works.

The issue is that it is super, super slow (I have a MacBook Pro with 16gb RAM and M1) and it's using close to 11GB of memory. Although the inference is slow, the output makes sense.

Given that it is so slow, I converted the model using float16:

mlmodel = export(tokenizer, pt_model, coreml_config, quantize="float16")

The model is now 5GB, but the inference is giving me gibberish (the output was, before, something that made sense, now it's just a bunch of exclamation marks). I downloaded the model (the 5GB one) into my iPhone 14 Pro and after a few seconds, while it is loading, the app just closes itself.

  1. How can I further decrease the model size? Can we quantize the model even more using CoreML?
  2. Why is the inference speed so slow (with the default float32)?
  3. Why is the model with quantize="float16" basically instantaneous, but outputting gibberish?

Thank you so much for the help!

Hello @miguel-arrf!

Thanks a lot for the detailed report, much appreciated 🙌 I agree that Phi-2 is a very exciting model to try! There are additional quantization techniques that we could apply, but I'd suggest we debug float16 first. Let me try to retrace your steps and I'll get back to you soon :)

Regarding speed in float32, it could be for a variety of reasons: perhaps some layers are being scheduled to run on CPU, perhaps the model is using too much memory and your system swaps. I'll take a look too. In addition to that, there are some performance optimization techniques for LLMs (kv caching, in particular) that we are currently working on, and that should help a lot. I'll keep you posted about that as well.

Finally, if you used the latest version of exporters, I believe that the tokenizer should have been picked up automatically by swift-transformers / swift-chat. I'll check that out too.

I wanted to know if these swift transformers for phi 2 is available in hugging face

Hi @omkar806: not yet, but soon. We found some problems during conversion of the model. As @miguel-arrf described, float16 inference does not work after conversion, we probably need to keep some layers in float32. I didn't have time to debug in depth, but want to do it soon. We'll post here when it's done.