[Performance] Replace MatMul with FullyConnected
kory opened this issue · comments
Kory Watson commented
Describe the issue
QNN's implementation of MatMul is much slower than the equivalent FullyConnected implementation.
The QNN delegate should insert reshapes around 3D/4D MatMul to squish the input to 2D, then replace all instances of MatMul with FullyConnected layers.
To reproduce
See attached models.
Urgency
No response
Platform
Linux
OS Version
Ubuntu 22.04
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
Latest
ONNX Runtime API
Python
Architecture
ARM64
Execution Provider
Other / Unknown
Execution Provider Library Version
QNN 2.20
Model File
Is this a quantized model?
No
Adrian Lizarraga commented
Hi @kory, QNN EP currently translates ONNX's 2D Matmul to QNN's MatMul op. Should QNN EP always translate ONNX's 2D MatMul to QNN FullyConnected?