KhronosGroup / MoltenVK

MoltenVK is a Vulkan Portability implementation. It layers a subset of the high-performance, industry-standard Vulkan graphics and compute API over Apple's Metal graphics framework, enabling Vulkan applications to run on macOS, iOS and tvOS.

Repository from Github https://github.comKhronosGroup/MoltenVKRepository from Github https://github.comKhronosGroup/MoltenVK

request: report integerDotProduct4x8BitPackedSignedAccelerated support for llama.cpp vulkan backend..

oscarbg opened this issue · comments

Hi,
just built llama.cpp on Mac with vulkan backend enabled..
./llama-cli --list-devices
shows no integer dot usage (int dot:0 below):

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Apple M4 (MoltenVK) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 32768 | **int dot**: 0 | matrix cores: none
Available devices:
  Metal: Apple M4 (10922 MiB, 10922 MiB free)

and searching code seems it's because to be enabled needs integerDotProduct4x8BitPackedSignedAccelerated to be supported also( reported true)..

see code of check:
https://github.com/ggml-org/llama.cpp/blob/79c1160b073b8148a404f3dd2584be1606dccc66/ggml/src/ggml-vulkan/ggml-vulkan.cpp#L4007

moltenvk report:
https://vulkan.gpuinfo.org/displayreport.php?id=40783#properties_extensions show
integerDotProduct4x8BitPackedSignedAccelerated false

EDIT: really just asking to be enabled in case we can expect to see performance gains using this extension in case not it's OK already..

EDIT2: according to vkpeak int dot8 on M4 is excellent:

device       = Apple M4

fp32-scalar  = 3947.38 GFLOPS
fp32-vec4    = 3797.12 GFLOPS

fp16-scalar  = 3937.71 GFLOPS
fp16-vec4    = 4008.02 GFLOPS

int32-scalar = 1009.51 GIOPS
int32-vec4   = 1009.63 GIOPS

int16-scalar = 1009.56 GIOPS
int16-vec4   = 1009.73 GIOPS

int8-dotprod = 14325.40 GIOPS

To my knowledge, there's no special support for the dot() function in MSL for integral types, not even in Metal 4. I presume from this that none of the AGX* instruction sets have an instruction specifically for performing a dot product of packed int8_t vectors, which is what that property indicates. Nor is there any function for performing a horizontal add on integer vectors, which is step two of the dot product operation. (hadd() is Half ADD, i.e. (x + y) / 2.) On the other hand, that number is quite impressive. Maybe I'm wrong, and the Metal compiler is smart enough to optimize our writing it out the long way to the specific instruction. Do you have numbers for other GPUs that do support integerDotProduct4x8BitPackedSignedAccelerated?

  • Apple Graphics aCCelerator, Apple's internal name for their own GPU architecture.

@cdavis5e thanks for detailed response.. somewhat missed to thankyou for your very informative answer..
I have vkpeak results for NV 4070 which has integerDotProduct4x8BitPackedSignedAccelerated as exposes int dot: 1 on llama (also can be seen on vulkan info reports)..
curious that you say as Nvidia perf is similar in dot product:
(nihui/vkpeak#7 (comment))

RTX 4070
int8-dotprod = 15665.38 GIOPS

on this Ada GPU..
seems a HW bug as Blackwell 5060 ti much faster:
nihui/vkpeak#25 (nihui/vkpeak#25 (comment))

device = NVIDIA GeForce RTX 5060 Ti
int8-dotprod = 99007.88 GIOPS