[Feature Request] phi-3-small-128k-onnx-cpu model

Question

[Feature Request] phi-3-small-128k-onnx-cpu model

Ben-Epstein opened this issue a month ago · comments

The onnx-gpu model for phi-3-small-128k is great, a perfect balance of quality and speed. Is there a plan to support a cpu version?

Thanks!

Baiju Meswani · Answer 1 · Tue May 28 2024 23:08:01 GMT+0800 (China Standard Time)

The Phi-Small model contains the SparseAttention operator and requires the kernel to be defined and implemented in ONNX Runtime. As of now, we only have the kernel implemented for CUDA.
We intend to add a CPU kernel as well in the near future. Once that is added, we will be able to support Phi-Small on CPU as well.

andliang · Answer 2 · Fri Jun 07 2024 10:08:37 GMT+0800 (China Standard Time)

We intend to add a CPU kernel as well in the near future. Once that is added, we will be able to support Phi-Small on CPU as well.

Interesting... according to these two pages Run Phi-3 language models with the ONNX Runtime generate() API and Run the Phi-3 vision model with the ONNX Runtime generate() API, they can run on CPU. Am I missing something?

Just a FYI, Im new to Python/Cuda/Pytorch/etc and this ecosystem.

Baiju Meswani · Answer 3 · Mon Jun 10 2024 11:04:24 GMT+0800 (China Standard Time)

Interesting... according to these two pages Run Phi-3 language models with the ONNX Runtime generate() API and Run the Phi-3 vision model with the ONNX Runtime generate() API, they can run on CPU. Am I missing something?

All phi3 family models except for phi3-small can be run on CPU. The linked documentation doesn't mention the phi3-small model. Maybe we should explicitly call it out in the doc.