[Performance] Replace MatMul with FullyConnected

Question

[Performance] Replace MatMul with FullyConnected

kory opened this issue 23 days ago · comments

Describe the issue

QNN's implementation of MatMul is much slower than the equivalent FullyConnected implementation.

The QNN delegate should insert reshapes around 3D/4D MatMul to squish the input to 2D, then replace all instances of MatMul with FullyConnected layers.

To reproduce

See attached models.

Urgency

No response

Platform

Linux

OS Version

Ubuntu 22.04

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

Latest

ONNX Runtime API

Python

Architecture

ARM64

Execution Provider

Other / Unknown

Execution Provider Library Version

QNN 2.20

Model File

MatMul_Repros.zip

Is this a quantized model?

No

Adrian Lizarraga · Answer 1 · Thu May 09 2024 04:47:26 GMT+0800 (China Standard Time)

Hi @kory, QNN EP currently translates ONNX's 2D Matmul to QNN's MatMul op. Should QNN EP always translate ONNX's 2D MatMul to QNN FullyConnected?