intel / intel-npu-acceleration-library

Intel® NPU Acceleration Library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Profiling and Allocation Control for Partially Offloaded Models

BICHENG opened this issue · comments

Problem Description
During the testing of my diffuser model using the Intel NPU Acceleration Library, I noticed that sometimes the entire model is not fully offloaded to the NPU. Instead, a significant portion of the model remains on the CPU without any errors being reported. This can lead to unexpectedly poor performance.

Desired Solution
To address this issue, I suggest introducing a profiler that identifies the most time-consuming operations within the model as well as the "endpoints/breakpoints" where the transformation fails. Based on the profiling results, provide a method for users to manually decompose the model.

I've noticed that there are quite a few NumPy operations during the compilation process, which don't seem to be torch-internal compilation.

Additionally, I acknowledge that this request might further complicate the existing issue: #26

Alternative Approaches
An alternative approach could be to improve the existing offloading mechanism to ensure that the entire model is progressively decomposed into complete blocks and offloaded to the NPU whenever possible.

For the remaining parts that cannot be offloaded, they should be left unchanged in terms of operators and keep their original device.

Additional Context
This requirement might significantly increase your workload, but the feature would be useful for large models or scenarios where NPU offloading is not entirely successful. It allows for fine-grained control over the execution of individual operations, which can be crucial for optimizing performance.

At the very least, users should be informed about cases where the transformation fails. Alternatively, they could leverage this "feature" to avoid expecting the NPU to handle low-performance operators without their knowledge.

As a wild idea, maybe you could achieve an iGPU+NPU combo by directly using IPEX (intel/intel-extension-for-pytorch) in some magical way?🤔

Also, I'm very curious about the connection between this work and OpenVINO.

Profiling is already implemented as we support torch.profile (for an example implementation look at profile_llm script).
I agree we should provide more control to the users, both in quantization (now we support neural compressor API should give the user the ability to select quantization scheme) and in general model compilation.

Also, I'm very curious about the connection between this work and OpenVINO.

OpenVINO it is used as a backend for NPU operations. For more info please tune in to the webinar I'll do next Wednesday about this:

I will continue to look for ways to identify the parts of the model that need to be "cut out".
Thank you for the recent updates, great work!