microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

Home Page:https://onnxruntime.ai

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Performance] Replace MatMul with FullyConnected

kory opened this issue · comments

Describe the issue

QNN's implementation of MatMul is much slower than the equivalent FullyConnected implementation.

The QNN delegate should insert reshapes around 3D/4D MatMul to squish the input to 2D, then replace all instances of MatMul with FullyConnected layers.

To reproduce

See attached models.

Urgency

No response

Platform

Linux

OS Version

Ubuntu 22.04

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

Latest

ONNX Runtime API

Python

Architecture

ARM64

Execution Provider

Other / Unknown

Execution Provider Library Version

QNN 2.20

Model File

MatMul_Repros.zip

Is this a quantized model?

No

Hi @kory, QNN EP currently translates ONNX's 2D Matmul to QNN's MatMul op. Should QNN EP always translate ONNX's 2D MatMul to QNN FullyConnected?