corsix / amx

Apple AMX Instruction Set

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A15/M2 Performance

philipturner opened this issue · comments

After analyzing the die shots and speculating on performance, I came across a major change to the AMX architecture. Would you mind reading through the README to amx-benchmarks and helping me test the hypothesis? You don't need to rent an M2 from the cloud; I can test on my A15.

permalink for the hypothesis in question

I'm afraid I don't follow exactly what your hypothesis is. I'm also not convinced by a number of things in your README.

I'll try to describe where I'm at. First, some terminology:

  • P CPU cluster: 1-4 performance CPU cores (I think my P CPU cluster is what you're calling a P block?)
  • E CPU cluster: 1-4 efficiency CPU cores
  • P AMX cluster: an AMX coprocessor associated with a P CPU cluster
  • E AMX cluster: an AMX coprocessor associated with an E CPU cluster
  • AMX cell: if you imagine the entire AMX computation grid as a 64-byte by 64-byte grid, a cell is an aligned 8-byte by 8-byte sub-grid, in which FMA/MAC operations are performed (a cell is approximately a PE in many Apple patents)

There's a 1:1 correspondence between CPU clusters and AMX clusters, and on die shots you'll see them colocated, along with a bunch of L2. Note that the clock speed of the AMX cluster needn't equal the clock speed of the associated CPU cluster.

To satisfy the needs of the ISA, each AMX cluster needs to contain:

  • X register file, which is at least 512 bytes per CPU core in the associated CPU cluster
  • Y register file, which is at least 512 bytes per CPU core in the associated CPU cluster
  • Z register file, which is at least 4096 bytes per CPU core in the associated CPU cluster
  • Some number of AMX cells, in some grid arrangement. This could be an 8x8 grid, or 8x2, or 4x4, or anything else of the form 1/2/4/8 by 1/2/4/8 (possibly what you're calling an AMX block is what I'm calling an 8x1 grid of AMX cells, meaning 8 blocks gives you an 8x8 grid, or 2 blocks gives you an 8x2 grid?)
  • Some control logic

The X and Y register files (or a combined X and Y register file) will be separate from the AMX cells, but the Z register file is likely split up and colocated in the AMX cells. The more AMX cells there are in a cluster, the smaller the amount of Z register file in each cell. If you had an 8x8 grid of cells, then you'd only need 64 bytes of Z (per CPU core) per cell. If you had an 8x2 grid of cells, then you'd need 256 bytes of Z (per CPU core) per cell. In particular, this could manifest itself as E AMX clusters having fewer cells than P AMX clusters, but each E cell being slightly larger than a P cell.

If you had an 8x8 grid of cells, then an entire AMX matrix instruction could be dispatched at once, whereas an 8x2 or 4x4 grid would require that each matrix instruction be split up into four pieces. If considering vector instructions, a 4x4 grid would require that vector instructions be split up into two pieces, whereas no splitting would be required for an 8x2 grid.

Each cell contains an F64 FMA circuit. Said circuit can be split up and used as four separate F32 FMA circuits, or split up in other ways for F16/BF16/integer-multiply-accumulate. That circuit might be split into four pipeline stages (e.g. 4 cycle compute latency), or at least the path from addend input to FMA output is four cycles. To fully saturate this circuit, each cell needs to be tasked with an F64 FMA per cycle (or four F32 FMAs, or ...). Given that the latency is 4 cycles, four distinct ranges of Z are required. If a matrix instruction refers to all 4096 bytes of Z, then multiple CPU cores need to be in play. This is why my performance tables have threads on one axis, and Z Accumulators per thread on the other axis, as both are routes to getting more Z in play, and sometimes you're constrained by Z before you're constrained by FMA circuits. Of course, adding threads can also put more AMX clusters in play (though you're at the whims of the scheduler as to whether your threads end up on different clusters or not).

The CPU cluster only sends instructions to the AMX cluster, not data. The ALUs and register files on the CPU cores are basically irrelevant to the AMX cluster. Data has to move via memory, and in particular via L2. The interesting question is how much bandwidth there is between the AMX cluster's register files and L2. A secondary question is how quickly CPU cores can enqueue AMX instructions - for M1 that means store ports, of which there are 2 per P core, and 1 per E core. Instruction fusion might let you enqueue two AMX instructions per port per cycle (again note that the dequeue needn't happen in the same clock domain).

If there's a major change in AMX between M1 and M2, I'd expect it to be in the layout of the E cells in the E AMX cluster, potentially switching from a 4x4 layout to an 8x2 layout (note that this logical layout, and needn't directly or exactly correspond to physical layout). This wouldn't change the number of E cells in the cluster, but would improve performance for vector operations. P AMX may well be 8x8 cells in both M1 and M2, with no major change there (except that said 8x8 might now be 4 copies of 8x2 rather than 4 copies of 4x4). Performance in most other ways would be mostly unchanged.

Thanks for the explanation! As stated on the top of my README, I think my current explanation is misleading - just haven't had the time to fix it. I still have some questions.

Have you seen the M2 Pro die shot? The P-AMX looks physically much different than with M1, almost double the area. That made me suspect Apple doubled performance with that generation (3.3 TFLOPS FP32 -> 7.3 TFLOPS FP32), and my M1 Max was seriously behind the performance of M2 Max. Hopefully this is not true.

Screenshot 2023-03-15 at 10 17 16 PM

162 x 75 pixels = 12150 pixels^2, 4 rectangles

Screenshot 2023-03-15 at 10 17 26 PM

120 x 185 pixels x 8/9 = 19733 pixels^2, 8 rectangles

Second, the vector throughput. M2 supports performing a vector instruction on 4 registers at once, while M1 only supports 1 register. Does the M2 have quadruple the vector throughput for non-GEMM-like operations? Or is it just an ISA optimization with no physical performance implications?

Third, clock speed. I know that if you activate more CPU cores, the entire blocks' max clock speed throttles. Does this throttling affect the AMX too, as if the AMX coprocessor is a fifth CPU core?

Finally, BF16 performance. Is there any inherent reason why FP16 FLOPs cannot exceed 2x FP32 FLOPs? Perhaps Z registers exceeding capacity because too many products are formed. If FP32 data consumes 2x the space of FP16 data, that would explain the following paradigm:

  • M1 Max FP16xFP16=FP16 (13.2 TFLOPS) would be without register bottleneck
  • M1 Max FP16xFP16=FP16 (6.6 TFLOPS) actual
  • M1 Max FP16xFP16=FP32 (13.2 TFLOPS) would be without register bottleneck
  • M1 Max FP16xFP16=FP32 (3.3 TFLOPS) actual = 1/2 16x16=16 <- special emphasis on this
  • M1 Max FP32xFP32=FP32 (3.3 TFLOPS) actual

Real-life code usually cannot reach 100% ALU utilization; 50-80% is common. Designing the AMX2 to hard-code 50% FP16 utilization would make sense when effective TFLOPS is:

  • M1 Max FP16xFP16=FP16 (6.6-10.6 TFLOPS) would be without register bottleneck
  • M1 Max FP16xFP16=FP32 (6.6-10.6 TFLOPS) would be without register bottleneck
  • M1 Max FP32xFP32=FP32 (1.7-2.6 TFLOPS) would be without register bottleneck

And as a final touch, simply remove half of the FP16xFP16=FP16 multipliers (M1 design). Redirect FP16xFP16=FP32 to the FP32xFP32=FP32 path and you use less transistors (M1 design). I hypothesize that Apple redesigned the A15/M2/M2 Pro (again, please look at the die shot) to (a) fix some Z-register bottleneck and through ISA improvements (b) improve effective TFLOPS. Then it would make practical sense to architect such that BF16 = 4x FP32, not 2x.

For matfp FP64xFP64=FP64, I'm seeing the following on M1 Max:

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 (512 bytes) per thread 92.8 GFLOPS 185.5 GFLOPS 214.0 GFLOPS 285.0 GFLOPS 381.3 GFLOPS 394.0 GFLOPS
2 (1024 bytes) per thread 185.4 GFLOPS 370.7 GFLOPS 337.4 GFLOPS 474.0 GFLOPS 658.9 GFLOPS 610.7 GFLOPS
3 (1536 bytes) per thread 278.0 GFLOPS 556.5 GFLOPS 472.8 GFLOPS 572.9 GFLOPS 646.6 GFLOPS 718.9 GFLOPS
4 (2048 bytes) per thread 370.8 GFLOPS 742.0 GFLOPS 610.4 GFLOPS 730.0 GFLOPS 747.2 GFLOPS 772.0 GFLOPS
5 (2560 bytes) per thread 370.9 GFLOPS 742.1 GFLOPS 610.6 GFLOPS 745.9 GFLOPS 700.9 GFLOPS 731.3 GFLOPS
6 (3072 bytes) per thread 371.0 GFLOPS 741.3 GFLOPS 608.8 GFLOPS 730.9 GFLOPS 727.4 GFLOPS 735.6 GFLOPS
7 (3584 bytes) per thread 370.7 GFLOPS 740.9 GFLOPS 610.4 GFLOPS 752.7 GFLOPS 700.4 GFLOPS 769.8 GFLOPS
8 (4096 bytes) per thread 370.9 GFLOPS 741.2 GFLOPS 651.1 GFLOPS 745.1 GFLOPS 796.2 GFLOPS 780.4 GFLOPS

On M2, the same thing is:

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 (512 bytes) per thread 102.3 GFLOPS 164.6 GFLOPS 246.9 GFLOPS 231.2 GFLOPS 288.9 GFLOPS 318.3 GFLOPS
2 (1024 bytes) per thread 204.6 GFLOPS 252.5 GFLOPS 378.7 GFLOPS 354.2 GFLOPS 384.7 GFLOPS 436.2 GFLOPS
3 (1536 bytes) per thread 306.9 GFLOPS 351.4 GFLOPS 434.3 GFLOPS 421.4 GFLOPS 438.6 GFLOPS 464.0 GFLOPS
4 (2048 bytes) per thread 409.2 GFLOPS 452.2 GFLOPS 468.8 GFLOPS 472.5 GFLOPS 476.5 GFLOPS 479.4 GFLOPS
5 (2560 bytes) per thread 409.2 GFLOPS 452.2 GFLOPS 468.8 GFLOPS 472.6 GFLOPS 476.6 GFLOPS 479.4 GFLOPS
6 (3072 bytes) per thread 409.2 GFLOPS 452.2 GFLOPS 468.8 GFLOPS 472.6 GFLOPS 476.5 GFLOPS 479.4 GFLOPS
7 (3584 bytes) per thread 409.2 GFLOPS 452.1 GFLOPS 468.8 GFLOPS 472.6 GFLOPS 476.6 GFLOPS 479.4 GFLOPS
8 (4096 bytes) per thread 409.2 GFLOPS 452.2 GFLOPS 468.8 GFLOPS 472.5 GFLOPS 476.4 GFLOPS 479.3 GFLOPS

The "1 Thread" column sees a ~10% uplift in performance, consistent with M2 clocks being 10% higher than M1. M1 Max gets ~100% improvement going from 1 thread to 2, which is consistent with M1 Max having two P clusters. The dip when going to 3 threads is likely a consequence of the same thing, with there being no good scheduling of 3 threads onto two AMX clusters. M2 sees only small gains from increasing thread count, as a single thread is able to almost saturate the entire AMX cluster.


Going down to matfp FP32xFP32=FP32, I'm seeing this on M1 Max:

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 (1024 bytes) per thread 370.9 GFLOPS 741.8 GFLOPS 857.0 GFLOPS 1264.1 GFLOPS 1425.0 GFLOPS 1354.2 GFLOPS
2 (2048 bytes) per thread 742.5 GFLOPS 1484.2 GFLOPS 1349.1 GFLOPS 1800.1 GFLOPS 2249.9 GFLOPS 2389.8 GFLOPS
3 (3072 bytes) per thread 1112.9 GFLOPS 2224.5 GFLOPS 1891.6 GFLOPS 2521.1 GFLOPS 2591.9 GFLOPS 2879.2 GFLOPS
4 (4096 bytes) per thread 1482.9 GFLOPS 2967.4 GFLOPS 2442.5 GFLOPS 3102.8 GFLOPS 2806.2 GFLOPS 3007.9 GFLOPS

And on M2:

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 (1024 bytes) per thread 409.2 GFLOPS 658.5 GFLOPS 987.5 GFLOPS 925.6 GFLOPS 1155.2 GFLOPS 1272.9 GFLOPS
2 (2048 bytes) per thread 818.4 GFLOPS 1009.8 GFLOPS 1514.6 GFLOPS 1419.8 GFLOPS 1542.8 GFLOPS 1743.8 GFLOPS
3 (3072 bytes) per thread 1227.5 GFLOPS 1405.5 GFLOPS 1737.0 GFLOPS 1687.2 GFLOPS 1755.1 GFLOPS 1856.0 GFLOPS
4 (4096 bytes) per thread 1636.9 GFLOPS 1808.5 GFLOPS 1874.9 GFLOPS 1889.7 GFLOPS 1907.3 GFLOPS 1916.7 GFLOPS

FP32 is getting 4x the performance of FP64. Other than that, all the previous remarks apply here basically verbatim.


Going down to FP16xFP16=FP32, M1 Max:

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 (4096 bytes) per thread 1483.7 GFLOPS 2967.8 GFLOPS 2272.4 GFLOPS 2457.5 GFLOPS 2808.7 GFLOPS 2586.7 GFLOPS

M2:

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 (4096 bytes) per thread 1636.8 GFLOPS 1639.4 GFLOPS 1638.6 GFLOPS 1397.1 GFLOPS 1703.9 GFLOPS 1704.0 GFLOPS

Similar performance to FP32xFP32=FP32.

On M2, we can also do BF16xBF16=FP32:

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 (4096 bytes) per thread 1636.7 GFLOPS 1639.5 GFLOPS 1638.5 GFLOPS 1397.2 GFLOPS 1703.8 GFLOPS 1704.0 GFLOPS

Same performance as FP16.


Going down to FP16xFP16=FP16, M1 Max:

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 (2048 bytes) per thread 1484.0 GFLOPS 2965.0 GFLOPS 2693.9 GFLOPS 3599.1 GFLOPS 4543.6 GFLOPS 5124.9 GFLOPS
2 (4096 bytes) per thread 2967.1 GFLOPS 5926.1 GFLOPS 4881.3 GFLOPS 6189.6 GFLOPS 6072.7 GFLOPS 5337.5 GFLOPS

M2:

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 (2048 bytes) per thread 1637.0 GFLOPS 2026.2 GFLOPS 3029.0 GFLOPS 2867.8 GFLOPS 3384.8 GFLOPS 3408.3 GFLOPS
2 (4096 bytes) per thread 3272.8 GFLOPS 3614.3 GFLOPS 3751.6 GFLOPS 2905.1 GFLOPS 3394.9 GFLOPS 3407.8 GFLOPS

Twice the performance of FP32.

On M2, we can also do BF16xBF16=BF16:

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 (2048 bytes) per thread 1636.8 GFLOPS 2026.1 GFLOPS 3028.9 GFLOPS 2865.6 GFLOPS 3385.4 GFLOPS 3407.3 GFLOPS
2 (4096 bytes) per thread 3273.6 GFLOPS 3613.7 GFLOPS 3750.4 GFLOPS 2915.7 GFLOPS 3395.3 GFLOPS 3406.4 GFLOPS

Same performance as FP16.

For vecfp FP64xFP64=FP64, I'm seeing the following on M1 Max:

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 (64 bytes) per thread 11.6 GFLOPS 23.2 GFLOPS 26.7 GFLOPS 39.4 GFLOPS 44.3 GFLOPS 52.0 GFLOPS
2 (128 bytes) per thread 23.2 GFLOPS 46.4 GFLOPS 53.5 GFLOPS 71.2 GFLOPS 88.9 GFLOPS 102.9 GFLOPS
3 (192 bytes) per thread 34.7 GFLOPS 69.5 GFLOPS 80.1 GFLOPS 107.0 GFLOPS 125.4 GFLOPS 122.5 GFLOPS
4 (256 bytes) per thread 46.3 GFLOPS 92.7 GFLOPS 106.6 GFLOPS 138.8 GFLOPS 165.0 GFLOPS 147.8 GFLOPS
5 (320 bytes) per thread 58.0 GFLOPS 115.9 GFLOPS 120.3 GFLOPS 147.7 GFLOPS 180.3 GFLOPS 175.4 GFLOPS
6 (384 bytes) per thread 69.5 GFLOPS 138.8 GFLOPS 135.9 GFLOPS 182.6 GFLOPS 192.8 GFLOPS 193.8 GFLOPS
7 (448 bytes) per thread 81.1 GFLOPS 162.2 GFLOPS 152.1 GFLOPS 192.3 GFLOPS 198.8 GFLOPS 202.2 GFLOPS
8 (512 bytes) per thread 92.8 GFLOPS 185.1 GFLOPS 168.4 GFLOPS 201.3 GFLOPS 201.1 GFLOPS 211.2 GFLOPS
9 (576 bytes) per thread 89.0 GFLOPS 177.3 GFLOPS 162.3 GFLOPS 198.6 GFLOPS 199.9 GFLOPS 206.6 GFLOPS
10 (640 bytes) per thread 91.9 GFLOPS 181.3 GFLOPS 165.6 GFLOPS 199.3 GFLOPS 200.7 GFLOPS 205.9 GFLOPS
11 (704 bytes) per thread 91.4 GFLOPS 181.6 GFLOPS 166.4 GFLOPS 202.1 GFLOPS 199.2 GFLOPS 198.5 GFLOPS
12 (768 bytes) per thread 92.7 GFLOPS 185.0 GFLOPS 167.8 GFLOPS 203.2 GFLOPS 201.7 GFLOPS 208.8 GFLOPS
13 (832 bytes) per thread 92.8 GFLOPS 185.3 GFLOPS 168.3 GFLOPS 187.4 GFLOPS 200.6 GFLOPS 208.3 GFLOPS
14 (896 bytes) per thread 92.8 GFLOPS 184.1 GFLOPS 168.6 GFLOPS 203.1 GFLOPS 201.4 GFLOPS 209.6 GFLOPS
15 (960 bytes) per thread 92.7 GFLOPS 185.4 GFLOPS 167.1 GFLOPS 202.7 GFLOPS 200.6 GFLOPS 208.1 GFLOPS
16 (1024 bytes) per thread 92.7 GFLOPS 185.0 GFLOPS 168.6 GFLOPS 202.4 GFLOPS 204.0 GFLOPS 209.0 GFLOPS

And on M2:

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 (64 bytes) per thread 12.8 GFLOPS 20.6 GFLOPS 30.9 GFLOPS 39.3 GFLOPS 49.1 GFLOPS 58.9 GFLOPS
2 (128 bytes) per thread 25.6 GFLOPS 41.1 GFLOPS 61.7 GFLOPS 78.6 GFLOPS 87.1 GFLOPS 85.2 GFLOPS
3 (192 bytes) per thread 38.4 GFLOPS 61.7 GFLOPS 89.6 GFLOPS 95.2 GFLOPS 108.1 GFLOPS 107.5 GFLOPS
4 (256 bytes) per thread 51.2 GFLOPS 82.3 GFLOPS 118.1 GFLOPS 115.8 GFLOPS 129.3 GFLOPS 132.0 GFLOPS
5 (320 bytes) per thread 63.9 GFLOPS 102.9 GFLOPS 139.6 GFLOPS 132.4 GFLOPS 143.0 GFLOPS 145.3 GFLOPS
6 (384 bytes) per thread 76.7 GFLOPS 113.1 GFLOPS 149.8 GFLOPS 143.2 GFLOPS 151.6 GFLOPS 154.9 GFLOPS
7 (448 bytes) per thread 89.5 GFLOPS 123.5 GFLOPS 150.3 GFLOPS 146.0 GFLOPS 152.0 GFLOPS 154.2 GFLOPS
8 (512 bytes) per thread 102.3 GFLOPS 134.8 GFLOPS 150.8 GFLOPS 150.6 GFLOPS 154.0 GFLOPS 155.1 GFLOPS
9 (576 bytes) per thread 102.3 GFLOPS 135.2 GFLOPS 151.5 GFLOPS 149.6 GFLOPS 153.0 GFLOPS 153.9 GFLOPS
10 (640 bytes) per thread 102.3 GFLOPS 135.3 GFLOPS 151.6 GFLOPS 150.6 GFLOPS 154.0 GFLOPS 154.5 GFLOPS
11 (704 bytes) per thread 102.2 GFLOPS 138.3 GFLOPS 154.9 GFLOPS 150.6 GFLOPS 153.6 GFLOPS 154.8 GFLOPS
12 (768 bytes) per thread 102.3 GFLOPS 135.2 GFLOPS 151.5 GFLOPS 150.7 GFLOPS 154.4 GFLOPS 154.9 GFLOPS
13 (832 bytes) per thread 102.3 GFLOPS 137.8 GFLOPS 154.3 GFLOPS 150.5 GFLOPS 153.6 GFLOPS 155.4 GFLOPS
14 (896 bytes) per thread 102.3 GFLOPS 135.1 GFLOPS 151.5 GFLOPS 150.5 GFLOPS 154.5 GFLOPS 155.4 GFLOPS
15 (960 bytes) per thread 102.3 GFLOPS 137.8 GFLOPS 154.2 GFLOPS 150.6 GFLOPS 153.7 GFLOPS 154.8 GFLOPS
16 (1024 bytes) per thread 102.3 GFLOPS 135.2 GFLOPS 151.5 GFLOPS 150.8 GFLOPS 154.1 GFLOPS 155.4 GFLOPS

No surprises hiding here.

M2 can dispatch 2 iterations at once:

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 (128 bytes) per thread 25.6 GFLOPS 41.1 GFLOPS 61.7 GFLOPS 78.8 GFLOPS 98.3 GFLOPS 117.6 GFLOPS
2 (256 bytes) per thread 51.1 GFLOPS 82.3 GFLOPS 123.4 GFLOPS 157.5 GFLOPS 174.2 GFLOPS 169.9 GFLOPS
3 (384 bytes) per thread 76.7 GFLOPS 123.4 GFLOPS 156.6 GFLOPS 169.2 GFLOPS 177.8 GFLOPS 176.7 GFLOPS
4 (512 bytes) per thread 102.3 GFLOPS 164.5 GFLOPS 175.6 GFLOPS 176.7 GFLOPS 178.0 GFLOPS 177.6 GFLOPS
5 (640 bytes) per thread 102.3 GFLOPS 164.6 GFLOPS 175.7 GFLOPS 176.7 GFLOPS 179.4 GFLOPS 177.5 GFLOPS
6 (768 bytes) per thread 102.3 GFLOPS 164.6 GFLOPS 175.7 GFLOPS 176.7 GFLOPS 179.3 GFLOPS 177.6 GFLOPS
7 (896 bytes) per thread 102.3 GFLOPS 164.6 GFLOPS 175.7 GFLOPS 176.7 GFLOPS 179.4 GFLOPS 178.1 GFLOPS
8 (1024 bytes) per thread 102.3 GFLOPS 164.6 GFLOPS 175.7 GFLOPS 178.0 GFLOPS 178.6 GFLOPS 177.7 GFLOPS
9 (1152 bytes) per thread 102.3 GFLOPS 164.5 GFLOPS 175.6 GFLOPS 177.9 GFLOPS 179.4 GFLOPS 178.4 GFLOPS
10 (1280 bytes) per thread 102.3 GFLOPS 164.5 GFLOPS 175.8 GFLOPS 177.8 GFLOPS 179.4 GFLOPS 179.2 GFLOPS
11 (1408 bytes) per thread 102.3 GFLOPS 164.6 GFLOPS 175.6 GFLOPS 176.7 GFLOPS 179.3 GFLOPS 178.3 GFLOPS
12 (1536 bytes) per thread 102.3 GFLOPS 164.5 GFLOPS 175.7 GFLOPS 176.7 GFLOPS 179.3 GFLOPS 177.9 GFLOPS
13 (1664 bytes) per thread 102.3 GFLOPS 164.6 GFLOPS 175.7 GFLOPS 176.7 GFLOPS 179.3 GFLOPS 177.6 GFLOPS
14 (1792 bytes) per thread 102.3 GFLOPS 164.6 GFLOPS 175.6 GFLOPS 176.8 GFLOPS 179.3 GFLOPS 177.6 GFLOPS
15 (1920 bytes) per thread 102.3 GFLOPS 164.6 GFLOPS 175.7 GFLOPS 176.6 GFLOPS 179.3 GFLOPS 178.9 GFLOPS
16 (2048 bytes) per thread 102.3 GFLOPS 164.6 GFLOPS 175.7 GFLOPS 176.7 GFLOPS 179.3 GFLOPS 178.7 GFLOPS

No gains to peak FLOPS to be found here, nor for 4 iterations at once:

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 (256 bytes) per thread 51.1 GFLOPS 82.5 GFLOPS 123.5 GFLOPS 93.2 GFLOPS 126.2 GFLOPS 152.9 GFLOPS
2 (512 bytes) per thread 102.3 GFLOPS 164.6 GFLOPS 175.7 GFLOPS 102.7 GFLOPS 163.2 GFLOPS 159.9 GFLOPS
3 (768 bytes) per thread 102.3 GFLOPS 163.5 GFLOPS 175.8 GFLOPS 102.9 GFLOPS 163.2 GFLOPS 162.3 GFLOPS
4 (1024 bytes) per thread 102.3 GFLOPS 164.5 GFLOPS 175.7 GFLOPS 104.9 GFLOPS 136.8 GFLOPS 163.1 GFLOPS
5 (1280 bytes) per thread 102.3 GFLOPS 163.5 GFLOPS 175.7 GFLOPS 103.9 GFLOPS 161.8 GFLOPS 163.9 GFLOPS
6 (1536 bytes) per thread 102.3 GFLOPS 163.4 GFLOPS 175.7 GFLOPS 102.9 GFLOPS 163.3 GFLOPS 163.4 GFLOPS
7 (1792 bytes) per thread 102.3 GFLOPS 163.5 GFLOPS 175.7 GFLOPS 104.9 GFLOPS 137.1 GFLOPS 158.5 GFLOPS
8 (2048 bytes) per thread 102.3 GFLOPS 164.5 GFLOPS 175.8 GFLOPS 102.8 GFLOPS 163.3 GFLOPS 162.1 GFLOPS
9 (2304 bytes) per thread 102.3 GFLOPS 164.4 GFLOPS 175.7 GFLOPS 104.0 GFLOPS 163.3 GFLOPS 162.9 GFLOPS
10 (2560 bytes) per thread 102.3 GFLOPS 164.6 GFLOPS 175.7 GFLOPS 102.5 GFLOPS 163.3 GFLOPS 163.1 GFLOPS
11 (2816 bytes) per thread 102.3 GFLOPS 164.6 GFLOPS 175.6 GFLOPS 104.0 GFLOPS 163.3 GFLOPS 162.3 GFLOPS
12 (3072 bytes) per thread 102.3 GFLOPS 164.5 GFLOPS 175.7 GFLOPS 103.7 GFLOPS 163.3 GFLOPS 160.8 GFLOPS
13 (3328 bytes) per thread 102.3 GFLOPS 164.5 GFLOPS 175.6 GFLOPS 102.8 GFLOPS 163.2 GFLOPS 162.7 GFLOPS
14 (3584 bytes) per thread 102.3 GFLOPS 164.6 GFLOPS 176.1 GFLOPS 103.9 GFLOPS 161.6 GFLOPS 162.9 GFLOPS
15 (3840 bytes) per thread 102.3 GFLOPS 164.4 GFLOPS 175.7 GFLOPS 103.3 GFLOPS 162.8 GFLOPS 162.7 GFLOPS
16 (4096 bytes) per thread 102.3 GFLOPS 164.6 GFLOPS 175.5 GFLOPS 102.6 GFLOPS 163.1 GFLOPS 162.5 GFLOPS

Going down to FP32xFP32=FP32, M1 Max:

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 (64 bytes) per thread 23.2 GFLOPS 46.4 GFLOPS 53.3 GFLOPS 81.1 GFLOPS 89.0 GFLOPS 104.1 GFLOPS
2 (128 bytes) per thread 46.4 GFLOPS 92.7 GFLOPS 106.5 GFLOPS 141.3 GFLOPS 176.8 GFLOPS 206.5 GFLOPS
3 (192 bytes) per thread 69.6 GFLOPS 139.1 GFLOPS 160.1 GFLOPS 213.3 GFLOPS 250.6 GFLOPS 244.9 GFLOPS
4 (256 bytes) per thread 92.7 GFLOPS 185.4 GFLOPS 214.0 GFLOPS 277.6 GFLOPS 325.5 GFLOPS 298.0 GFLOPS
5 (320 bytes) per thread 115.8 GFLOPS 231.7 GFLOPS 241.0 GFLOPS 321.3 GFLOPS 355.1 GFLOPS 347.7 GFLOPS
6 (384 bytes) per thread 139.0 GFLOPS 277.7 GFLOPS 271.2 GFLOPS 361.7 GFLOPS 387.1 GFLOPS 386.2 GFLOPS
7 (448 bytes) per thread 162.2 GFLOPS 324.2 GFLOPS 299.9 GFLOPS 383.4 GFLOPS 394.0 GFLOPS 400.9 GFLOPS
8 (512 bytes) per thread 185.5 GFLOPS 369.9 GFLOPS 335.8 GFLOPS 392.9 GFLOPS 405.8 GFLOPS 416.0 GFLOPS
9 (576 bytes) per thread 178.0 GFLOPS 353.4 GFLOPS 325.5 GFLOPS 396.9 GFLOPS 398.0 GFLOPS 409.2 GFLOPS
10 (640 bytes) per thread 183.1 GFLOPS 360.6 GFLOPS 335.3 GFLOPS 402.4 GFLOPS 401.2 GFLOPS 417.2 GFLOPS
11 (704 bytes) per thread 183.1 GFLOPS 363.0 GFLOPS 334.2 GFLOPS 403.2 GFLOPS 400.6 GFLOPS 415.8 GFLOPS
12 (768 bytes) per thread 185.2 GFLOPS 370.6 GFLOPS 335.5 GFLOPS 378.5 GFLOPS 397.7 GFLOPS 419.0 GFLOPS
13 (832 bytes) per thread 185.2 GFLOPS 369.4 GFLOPS 336.0 GFLOPS 404.2 GFLOPS 400.9 GFLOPS 414.1 GFLOPS
14 (896 bytes) per thread 185.5 GFLOPS 370.5 GFLOPS 336.4 GFLOPS 406.0 GFLOPS 402.9 GFLOPS 416.4 GFLOPS
15 (960 bytes) per thread 185.5 GFLOPS 370.0 GFLOPS 336.8 GFLOPS 405.7 GFLOPS 402.6 GFLOPS 409.6 GFLOPS
16 (1024 bytes) per thread 185.4 GFLOPS 370.4 GFLOPS 336.3 GFLOPS 406.0 GFLOPS 399.7 GFLOPS 405.3 GFLOPS

M2:

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 (64 bytes) per thread 25.6 GFLOPS 41.2 GFLOPS 61.7 GFLOPS 78.7 GFLOPS 98.4 GFLOPS 117.7 GFLOPS
2 (128 bytes) per thread 51.2 GFLOPS 82.3 GFLOPS 123.5 GFLOPS 157.7 GFLOPS 174.1 GFLOPS 170.4 GFLOPS
3 (192 bytes) per thread 76.7 GFLOPS 123.4 GFLOPS 179.5 GFLOPS 191.0 GFLOPS 216.9 GFLOPS 215.1 GFLOPS
4 (256 bytes) per thread 102.2 GFLOPS 164.6 GFLOPS 237.1 GFLOPS 231.8 GFLOPS 258.3 GFLOPS 263.3 GFLOPS
5 (320 bytes) per thread 127.8 GFLOPS 205.7 GFLOPS 279.1 GFLOPS 264.8 GFLOPS 285.7 GFLOPS 289.5 GFLOPS
6 (384 bytes) per thread 153.5 GFLOPS 226.0 GFLOPS 299.5 GFLOPS 286.6 GFLOPS 300.5 GFLOPS 308.3 GFLOPS
7 (448 bytes) per thread 179.0 GFLOPS 246.6 GFLOPS 300.6 GFLOPS 291.4 GFLOPS 302.4 GFLOPS 306.2 GFLOPS
8 (512 bytes) per thread 204.4 GFLOPS 269.7 GFLOPS 301.6 GFLOPS 299.4 GFLOPS 309.2 GFLOPS 310.4 GFLOPS
9 (576 bytes) per thread 204.6 GFLOPS 270.5 GFLOPS 302.9 GFLOPS 297.9 GFLOPS 304.7 GFLOPS 307.3 GFLOPS
10 (640 bytes) per thread 204.7 GFLOPS 270.3 GFLOPS 303.0 GFLOPS 300.2 GFLOPS 306.9 GFLOPS 308.9 GFLOPS
11 (704 bytes) per thread 204.6 GFLOPS 276.5 GFLOPS 308.4 GFLOPS 302.1 GFLOPS 305.8 GFLOPS 307.5 GFLOPS
12 (768 bytes) per thread 204.5 GFLOPS 270.5 GFLOPS 302.9 GFLOPS 299.9 GFLOPS 304.2 GFLOPS 307.5 GFLOPS
13 (832 bytes) per thread 204.6 GFLOPS 275.3 GFLOPS 307.9 GFLOPS 299.8 GFLOPS 306.4 GFLOPS 307.4 GFLOPS
14 (896 bytes) per thread 204.2 GFLOPS 270.5 GFLOPS 302.9 GFLOPS 299.6 GFLOPS 306.9 GFLOPS 310.6 GFLOPS
15 (960 bytes) per thread 204.5 GFLOPS 275.7 GFLOPS 308.5 GFLOPS 299.5 GFLOPS 305.5 GFLOPS 307.4 GFLOPS
16 (1024 bytes) per thread 204.6 GFLOPS 270.5 GFLOPS 302.8 GFLOPS 299.8 GFLOPS 306.9 GFLOPS 307.4 GFLOPS

M2 two at a time:

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 (128 bytes) per thread 51.2 GFLOPS 82.3 GFLOPS 123.5 GFLOPS 157.7 GFLOPS 196.4 GFLOPS 235.4 GFLOPS
2 (256 bytes) per thread 102.3 GFLOPS 164.6 GFLOPS 246.7 GFLOPS 316.2 GFLOPS 346.2 GFLOPS 339.7 GFLOPS
3 (384 bytes) per thread 153.4 GFLOPS 246.9 GFLOPS 313.0 GFLOPS 338.8 GFLOPS 355.6 GFLOPS 348.5 GFLOPS
4 (512 bytes) per thread 204.6 GFLOPS 328.8 GFLOPS 351.4 GFLOPS 355.6 GFLOPS 356.1 GFLOPS 349.9 GFLOPS
5 (640 bytes) per thread 204.5 GFLOPS 329.0 GFLOPS 351.2 GFLOPS 355.9 GFLOPS 356.1 GFLOPS 349.0 GFLOPS
6 (768 bytes) per thread 204.6 GFLOPS 329.3 GFLOPS 351.2 GFLOPS 350.8 GFLOPS 353.9 GFLOPS 356.3 GFLOPS
7 (896 bytes) per thread 204.5 GFLOPS 329.0 GFLOPS 346.6 GFLOPS 350.7 GFLOPS 356.1 GFLOPS 357.7 GFLOPS
8 (1024 bytes) per thread 204.7 GFLOPS 329.3 GFLOPS 351.4 GFLOPS 353.6 GFLOPS 358.2 GFLOPS 354.9 GFLOPS
9 (1152 bytes) per thread 204.6 GFLOPS 328.8 GFLOPS 351.2 GFLOPS 346.2 GFLOPS 358.4 GFLOPS 349.1 GFLOPS
10 (1280 bytes) per thread 204.5 GFLOPS 329.3 GFLOPS 351.6 GFLOPS 351.0 GFLOPS 354.9 GFLOPS 355.4 GFLOPS
11 (1408 bytes) per thread 204.4 GFLOPS 328.9 GFLOPS 351.3 GFLOPS 350.8 GFLOPS 358.3 GFLOPS 348.7 GFLOPS
12 (1536 bytes) per thread 204.6 GFLOPS 329.2 GFLOPS 351.4 GFLOPS 350.9 GFLOPS 356.0 GFLOPS 355.6 GFLOPS
13 (1664 bytes) per thread 204.4 GFLOPS 329.1 GFLOPS 351.1 GFLOPS 350.8 GFLOPS 356.1 GFLOPS 356.4 GFLOPS
14 (1792 bytes) per thread 204.7 GFLOPS 329.0 GFLOPS 351.2 GFLOPS 350.9 GFLOPS 356.2 GFLOPS 349.7 GFLOPS
15 (1920 bytes) per thread 204.5 GFLOPS 329.3 GFLOPS 351.2 GFLOPS 355.1 GFLOPS 356.0 GFLOPS 356.7 GFLOPS
16 (2048 bytes) per thread 204.6 GFLOPS 328.7 GFLOPS 351.0 GFLOPS 350.9 GFLOPS 356.1 GFLOPS 348.8 GFLOPS

And four at a time:

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 (256 bytes) per thread 102.3 GFLOPS 164.9 GFLOPS 247.0 GFLOPS 187.5 GFLOPS 250.9 GFLOPS 305.7 GFLOPS
2 (512 bytes) per thread 204.5 GFLOPS 326.9 GFLOPS 351.4 GFLOPS 208.2 GFLOPS 326.6 GFLOPS 323.5 GFLOPS
3 (768 bytes) per thread 204.6 GFLOPS 326.8 GFLOPS 351.4 GFLOPS 211.6 GFLOPS 320.5 GFLOPS 324.9 GFLOPS
4 (1024 bytes) per thread 204.6 GFLOPS 329.3 GFLOPS 351.3 GFLOPS 205.7 GFLOPS 326.6 GFLOPS 325.6 GFLOPS
5 (1280 bytes) per thread 204.6 GFLOPS 329.1 GFLOPS 351.2 GFLOPS 205.4 GFLOPS 326.5 GFLOPS 322.4 GFLOPS
6 (1536 bytes) per thread 204.6 GFLOPS 328.9 GFLOPS 351.4 GFLOPS 208.7 GFLOPS 318.2 GFLOPS 322.7 GFLOPS
7 (1792 bytes) per thread 204.6 GFLOPS 329.2 GFLOPS 351.4 GFLOPS 205.9 GFLOPS 326.4 GFLOPS 324.0 GFLOPS
8 (2048 bytes) per thread 204.5 GFLOPS 329.1 GFLOPS 351.4 GFLOPS 208.1 GFLOPS 326.5 GFLOPS 321.3 GFLOPS
9 (2304 bytes) per thread 204.6 GFLOPS 329.2 GFLOPS 351.4 GFLOPS 207.3 GFLOPS 323.6 GFLOPS 326.9 GFLOPS
10 (2560 bytes) per thread 204.6 GFLOPS 329.2 GFLOPS 351.4 GFLOPS 206.2 GFLOPS 320.8 GFLOPS 326.7 GFLOPS
11 (2816 bytes) per thread 204.5 GFLOPS 326.9 GFLOPS 346.5 GFLOPS 208.2 GFLOPS 326.3 GFLOPS 321.5 GFLOPS
12 (3072 bytes) per thread 204.6 GFLOPS 329.1 GFLOPS 351.1 GFLOPS 205.9 GFLOPS 326.5 GFLOPS 326.4 GFLOPS
13 (3328 bytes) per thread 204.5 GFLOPS 329.3 GFLOPS 351.4 GFLOPS 206.6 GFLOPS 323.5 GFLOPS 323.4 GFLOPS
14 (3584 bytes) per thread 204.6 GFLOPS 329.2 GFLOPS 351.4 GFLOPS 205.5 GFLOPS 326.4 GFLOPS 323.2 GFLOPS
15 (3840 bytes) per thread 204.6 GFLOPS 329.2 GFLOPS 351.0 GFLOPS 205.8 GFLOPS 326.5 GFLOPS 322.5 GFLOPS
16 (4096 bytes) per thread 204.6 GFLOPS 327.1 GFLOPS 351.2 GFLOPS 208.1 GFLOPS 324.0 GFLOPS 321.7 GFLOPS

FP16xFP16=FP16, M1 Max:

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 (64 bytes) per thread 46.3 GFLOPS 92.7 GFLOPS 106.7 GFLOPS 142.6 GFLOPS 189.5 GFLOPS 207.8 GFLOPS
2 (128 bytes) per thread 92.8 GFLOPS 185.2 GFLOPS 209.9 GFLOPS 312.7 GFLOPS 346.7 GFLOPS 408.2 GFLOPS
3 (192 bytes) per thread 139.2 GFLOPS 277.7 GFLOPS 317.5 GFLOPS 424.2 GFLOPS 507.9 GFLOPS 482.5 GFLOPS
4 (256 bytes) per thread 185.5 GFLOPS 370.4 GFLOPS 408.3 GFLOPS 552.6 GFLOPS 654.9 GFLOPS 589.2 GFLOPS
5 (320 bytes) per thread 231.5 GFLOPS 463.4 GFLOPS 479.3 GFLOPS 630.2 GFLOPS 711.5 GFLOPS 710.2 GFLOPS
6 (384 bytes) per thread 277.8 GFLOPS 556.3 GFLOPS 540.5 GFLOPS 721.1 GFLOPS 813.0 GFLOPS 734.5 GFLOPS
7 (448 bytes) per thread 324.8 GFLOPS 647.3 GFLOPS 607.7 GFLOPS 769.6 GFLOPS 789.4 GFLOPS 802.9 GFLOPS
8 (512 bytes) per thread 371.0 GFLOPS 739.4 GFLOPS 672.2 GFLOPS 810.8 GFLOPS 824.8 GFLOPS 813.7 GFLOPS
9 (576 bytes) per thread 354.5 GFLOPS 712.8 GFLOPS 652.8 GFLOPS 792.6 GFLOPS 796.6 GFLOPS 787.1 GFLOPS
10 (640 bytes) per thread 365.5 GFLOPS 717.8 GFLOPS 662.3 GFLOPS 798.5 GFLOPS 788.0 GFLOPS 828.2 GFLOPS
11 (704 bytes) per thread 365.2 GFLOPS 731.5 GFLOPS 670.2 GFLOPS 799.1 GFLOPS 806.8 GFLOPS 825.8 GFLOPS
12 (768 bytes) per thread 371.3 GFLOPS 740.5 GFLOPS 670.5 GFLOPS 797.7 GFLOPS 804.7 GFLOPS 830.4 GFLOPS
13 (832 bytes) per thread 370.7 GFLOPS 741.2 GFLOPS 671.3 GFLOPS 812.3 GFLOPS 801.4 GFLOPS 826.5 GFLOPS
14 (896 bytes) per thread 371.1 GFLOPS 740.0 GFLOPS 671.7 GFLOPS 806.5 GFLOPS 804.5 GFLOPS 835.0 GFLOPS
15 (960 bytes) per thread 370.6 GFLOPS 740.6 GFLOPS 671.1 GFLOPS 805.0 GFLOPS 804.6 GFLOPS 831.8 GFLOPS
16 (1024 bytes) per thread 369.1 GFLOPS 737.8 GFLOPS 816.6 GFLOPS 808.5 GFLOPS 807.5 GFLOPS 818.8 GFLOPS

M2:

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 (64 bytes) per thread 51.2 GFLOPS 82.3 GFLOPS 123.4 GFLOPS 157.6 GFLOPS 197.4 GFLOPS 234.4 GFLOPS
2 (128 bytes) per thread 102.3 GFLOPS 164.5 GFLOPS 246.9 GFLOPS 314.6 GFLOPS 350.2 GFLOPS 339.0 GFLOPS
3 (192 bytes) per thread 153.5 GFLOPS 246.8 GFLOPS 360.2 GFLOPS 380.8 GFLOPS 434.3 GFLOPS 424.7 GFLOPS
4 (256 bytes) per thread 204.6 GFLOPS 329.3 GFLOPS 471.7 GFLOPS 463.3 GFLOPS 519.7 GFLOPS 522.5 GFLOPS
5 (320 bytes) per thread 255.8 GFLOPS 411.0 GFLOPS 557.3 GFLOPS 524.1 GFLOPS 568.5 GFLOPS 572.6 GFLOPS
6 (384 bytes) per thread 306.8 GFLOPS 451.8 GFLOPS 599.1 GFLOPS 571.5 GFLOPS 607.2 GFLOPS 607.2 GFLOPS
7 (448 bytes) per thread 358.2 GFLOPS 493.7 GFLOPS 601.1 GFLOPS 580.0 GFLOPS 591.6 GFLOPS 610.4 GFLOPS
8 (512 bytes) per thread 409.2 GFLOPS 538.5 GFLOPS 603.2 GFLOPS 594.4 GFLOPS 608.4 GFLOPS 620.5 GFLOPS
9 (576 bytes) per thread 408.9 GFLOPS 540.7 GFLOPS 605.6 GFLOPS 583.0 GFLOPS 604.4 GFLOPS 617.9 GFLOPS
10 (640 bytes) per thread 408.8 GFLOPS 540.9 GFLOPS 605.4 GFLOPS 594.5 GFLOPS 614.2 GFLOPS 616.3 GFLOPS
11 (704 bytes) per thread 409.1 GFLOPS 553.3 GFLOPS 614.4 GFLOPS 603.7 GFLOPS 606.4 GFLOPS 614.8 GFLOPS
12 (768 bytes) per thread 409.2 GFLOPS 540.5 GFLOPS 605.8 GFLOPS 599.9 GFLOPS 608.6 GFLOPS 620.3 GFLOPS
13 (832 bytes) per thread 409.4 GFLOPS 550.2 GFLOPS 614.5 GFLOPS 594.7 GFLOPS 606.0 GFLOPS 608.0 GFLOPS
14 (896 bytes) per thread 408.7 GFLOPS 538.7 GFLOPS 606.1 GFLOPS 594.9 GFLOPS 608.5 GFLOPS 618.3 GFLOPS
15 (960 bytes) per thread 409.1 GFLOPS 551.0 GFLOPS 614.5 GFLOPS 594.3 GFLOPS 615.0 GFLOPS 607.6 GFLOPS
16 (1024 bytes) per thread 408.8 GFLOPS 541.0 GFLOPS 605.3 GFLOPS 594.7 GFLOPS 608.9 GFLOPS 621.0 GFLOPS

M2, two at a time:

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 (128 bytes) per thread 102.3 GFLOPS 164.6 GFLOPS 246.8 GFLOPS 314.4 GFLOPS 396.7 GFLOPS 472.1 GFLOPS
2 (256 bytes) per thread 204.6 GFLOPS 329.3 GFLOPS 493.8 GFLOPS 632.7 GFLOPS 696.4 GFLOPS 677.2 GFLOPS
3 (384 bytes) per thread 306.5 GFLOPS 493.9 GFLOPS 626.3 GFLOPS 680.0 GFLOPS 710.9 GFLOPS 702.7 GFLOPS
4 (512 bytes) per thread 409.0 GFLOPS 657.8 GFLOPS 701.5 GFLOPS 702.0 GFLOPS 712.3 GFLOPS 712.9 GFLOPS
5 (640 bytes) per thread 409.2 GFLOPS 658.7 GFLOPS 702.3 GFLOPS 708.4 GFLOPS 712.3 GFLOPS 714.1 GFLOPS
6 (768 bytes) per thread 409.2 GFLOPS 658.1 GFLOPS 702.8 GFLOPS 701.5 GFLOPS 712.4 GFLOPS 698.1 GFLOPS
7 (896 bytes) per thread 409.3 GFLOPS 658.4 GFLOPS 702.8 GFLOPS 710.9 GFLOPS 712.2 GFLOPS 709.4 GFLOPS
8 (1024 bytes) per thread 408.9 GFLOPS 658.3 GFLOPS 702.6 GFLOPS 700.4 GFLOPS 712.4 GFLOPS 712.6 GFLOPS
9 (1152 bytes) per thread 409.1 GFLOPS 658.4 GFLOPS 702.8 GFLOPS 701.3 GFLOPS 712.2 GFLOPS 698.1 GFLOPS
10 (1280 bytes) per thread 409.1 GFLOPS 658.5 GFLOPS 702.7 GFLOPS 701.4 GFLOPS 712.3 GFLOPS 697.7 GFLOPS
11 (1408 bytes) per thread 409.4 GFLOPS 658.6 GFLOPS 702.5 GFLOPS 702.0 GFLOPS 712.1 GFLOPS 698.6 GFLOPS
12 (1536 bytes) per thread 409.2 GFLOPS 658.4 GFLOPS 702.7 GFLOPS 704.1 GFLOPS 712.1 GFLOPS 697.9 GFLOPS
13 (1664 bytes) per thread 409.2 GFLOPS 657.2 GFLOPS 702.6 GFLOPS 701.9 GFLOPS 712.3 GFLOPS 698.7 GFLOPS
14 (1792 bytes) per thread 409.1 GFLOPS 658.3 GFLOPS 702.2 GFLOPS 711.5 GFLOPS 712.0 GFLOPS 710.3 GFLOPS
15 (1920 bytes) per thread 409.0 GFLOPS 657.4 GFLOPS 702.4 GFLOPS 701.7 GFLOPS 712.3 GFLOPS 714.4 GFLOPS
16 (2048 bytes) per thread 409.0 GFLOPS 658.2 GFLOPS 702.7 GFLOPS 707.3 GFLOPS 707.0 GFLOPS 715.4 GFLOPS

M2, four at a time:

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 (256 bytes) per thread 204.6 GFLOPS 329.1 GFLOPS 493.7 GFLOPS 403.3 GFLOPS 502.0 GFLOPS 608.1 GFLOPS
2 (512 bytes) per thread 409.1 GFLOPS 657.9 GFLOPS 702.7 GFLOPS 516.3 GFLOPS 636.5 GFLOPS 629.4 GFLOPS
3 (768 bytes) per thread 409.2 GFLOPS 658.1 GFLOPS 702.8 GFLOPS 510.1 GFLOPS 652.8 GFLOPS 642.5 GFLOPS
4 (1024 bytes) per thread 409.3 GFLOPS 658.3 GFLOPS 702.3 GFLOPS 504.4 GFLOPS 652.8 GFLOPS 644.4 GFLOPS
5 (1280 bytes) per thread 409.1 GFLOPS 658.4 GFLOPS 702.6 GFLOPS 515.9 GFLOPS 653.2 GFLOPS 648.9 GFLOPS
6 (1536 bytes) per thread 409.0 GFLOPS 658.4 GFLOPS 702.6 GFLOPS 516.0 GFLOPS 652.1 GFLOPS 642.9 GFLOPS
7 (1792 bytes) per thread 409.2 GFLOPS 658.2 GFLOPS 702.5 GFLOPS 510.2 GFLOPS 466.7 GFLOPS 643.1 GFLOPS
8 (2048 bytes) per thread 409.1 GFLOPS 658.1 GFLOPS 702.2 GFLOPS 516.1 GFLOPS 651.8 GFLOPS 643.0 GFLOPS
9 (2304 bytes) per thread 409.3 GFLOPS 657.7 GFLOPS 702.2 GFLOPS 501.7 GFLOPS 619.4 GFLOPS 646.4 GFLOPS
10 (2560 bytes) per thread 409.2 GFLOPS 658.7 GFLOPS 702.8 GFLOPS 516.2 GFLOPS 652.1 GFLOPS 635.1 GFLOPS
11 (2816 bytes) per thread 409.3 GFLOPS 650.2 GFLOPS 702.6 GFLOPS 504.3 GFLOPS 652.9 GFLOPS 638.9 GFLOPS
12 (3072 bytes) per thread 409.0 GFLOPS 658.4 GFLOPS 701.7 GFLOPS 515.3 GFLOPS 653.2 GFLOPS 643.3 GFLOPS
13 (3328 bytes) per thread 409.2 GFLOPS 650.1 GFLOPS 702.6 GFLOPS 516.2 GFLOPS 652.5 GFLOPS 636.3 GFLOPS
14 (3584 bytes) per thread 409.3 GFLOPS 649.5 GFLOPS 703.0 GFLOPS 516.0 GFLOPS 652.6 GFLOPS 627.6 GFLOPS
15 (3840 bytes) per thread 409.4 GFLOPS 658.4 GFLOPS 702.8 GFLOPS 516.2 GFLOPS 652.6 GFLOPS 640.9 GFLOPS
16 (4096 bytes) per thread 409.1 GFLOPS 658.3 GFLOPS 702.9 GFLOPS 504.0 GFLOPS 652.5 GFLOPS 638.1 GFLOPS

BF16xBF16=BF16, M2:

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 (64 bytes) per thread 51.2 GFLOPS 82.3 GFLOPS 123.5 GFLOPS 157.6 GFLOPS 197.1 GFLOPS 236.2 GFLOPS
2 (128 bytes) per thread 102.3 GFLOPS 164.6 GFLOPS 246.7 GFLOPS 314.0 GFLOPS 348.5 GFLOPS 337.2 GFLOPS
3 (192 bytes) per thread 153.3 GFLOPS 246.8 GFLOPS 358.2 GFLOPS 380.4 GFLOPS 431.5 GFLOPS 426.8 GFLOPS
4 (256 bytes) per thread 204.5 GFLOPS 329.2 GFLOPS 473.3 GFLOPS 464.6 GFLOPS 516.2 GFLOPS 532.0 GFLOPS
5 (320 bytes) per thread 255.3 GFLOPS 410.9 GFLOPS 558.2 GFLOPS 528.9 GFLOPS 570.6 GFLOPS 572.0 GFLOPS
6 (384 bytes) per thread 306.8 GFLOPS 452.0 GFLOPS 599.3 GFLOPS 572.4 GFLOPS 605.5 GFLOPS 607.1 GFLOPS
7 (448 bytes) per thread 357.9 GFLOPS 494.2 GFLOPS 601.6 GFLOPS 579.4 GFLOPS 601.6 GFLOPS 613.0 GFLOPS
8 (512 bytes) per thread 409.4 GFLOPS 538.6 GFLOPS 602.6 GFLOPS 594.5 GFLOPS 617.9 GFLOPS 616.6 GFLOPS
9 (576 bytes) per thread 409.2 GFLOPS 540.3 GFLOPS 606.1 GFLOPS 600.5 GFLOPS 604.3 GFLOPS 605.5 GFLOPS
10 (640 bytes) per thread 408.9 GFLOPS 539.8 GFLOPS 605.7 GFLOPS 594.9 GFLOPS 608.8 GFLOPS 611.5 GFLOPS
11 (704 bytes) per thread 408.7 GFLOPS 553.3 GFLOPS 614.7 GFLOPS 595.3 GFLOPS 606.2 GFLOPS 618.3 GFLOPS
12 (768 bytes) per thread 409.2 GFLOPS 540.9 GFLOPS 605.6 GFLOPS 598.7 GFLOPS 611.0 GFLOPS 608.8 GFLOPS
13 (832 bytes) per thread 409.2 GFLOPS 550.6 GFLOPS 614.4 GFLOPS 599.6 GFLOPS 611.2 GFLOPS 608.7 GFLOPS
14 (896 bytes) per thread 409.4 GFLOPS 540.5 GFLOPS 606.1 GFLOPS 594.9 GFLOPS 608.4 GFLOPS 612.6 GFLOPS
15 (960 bytes) per thread 408.7 GFLOPS 551.0 GFLOPS 614.7 GFLOPS 593.0 GFLOPS 607.4 GFLOPS 607.5 GFLOPS
16 (1024 bytes) per thread 409.0 GFLOPS 540.6 GFLOPS 605.6 GFLOPS 594.6 GFLOPS 616.6 GFLOPS 608.4 GFLOPS

Two at a time:

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 (128 bytes) per thread 102.3 GFLOPS 164.6 GFLOPS 246.9 GFLOPS 315.3 GFLOPS 392.8 GFLOPS 468.2 GFLOPS
2 (256 bytes) per thread 204.6 GFLOPS 329.1 GFLOPS 493.9 GFLOPS 629.2 GFLOPS 691.9 GFLOPS 681.5 GFLOPS
3 (384 bytes) per thread 306.9 GFLOPS 493.3 GFLOPS 626.1 GFLOPS 677.4 GFLOPS 711.0 GFLOPS 699.5 GFLOPS
4 (512 bytes) per thread 409.5 GFLOPS 658.3 GFLOPS 702.9 GFLOPS 707.9 GFLOPS 712.2 GFLOPS 697.7 GFLOPS
5 (640 bytes) per thread 409.3 GFLOPS 657.7 GFLOPS 702.5 GFLOPS 710.4 GFLOPS 712.1 GFLOPS 708.3 GFLOPS
6 (768 bytes) per thread 409.2 GFLOPS 657.7 GFLOPS 702.5 GFLOPS 702.1 GFLOPS 712.2 GFLOPS 697.2 GFLOPS
7 (896 bytes) per thread 409.0 GFLOPS 658.3 GFLOPS 702.6 GFLOPS 705.6 GFLOPS 712.2 GFLOPS 712.9 GFLOPS
8 (1024 bytes) per thread 409.1 GFLOPS 657.3 GFLOPS 702.3 GFLOPS 701.8 GFLOPS 712.0 GFLOPS 697.7 GFLOPS
9 (1152 bytes) per thread 409.1 GFLOPS 658.4 GFLOPS 702.7 GFLOPS 701.5 GFLOPS 712.1 GFLOPS 697.5 GFLOPS
10 (1280 bytes) per thread 409.0 GFLOPS 657.5 GFLOPS 702.7 GFLOPS 711.4 GFLOPS 712.3 GFLOPS 713.0 GFLOPS
11 (1408 bytes) per thread 409.0 GFLOPS 658.5 GFLOPS 702.5 GFLOPS 701.5 GFLOPS 712.4 GFLOPS 714.4 GFLOPS
12 (1536 bytes) per thread 409.8 GFLOPS 657.8 GFLOPS 702.9 GFLOPS 702.0 GFLOPS 712.2 GFLOPS 696.9 GFLOPS
13 (1664 bytes) per thread 409.1 GFLOPS 658.5 GFLOPS 701.4 GFLOPS 702.0 GFLOPS 712.3 GFLOPS 698.0 GFLOPS
14 (1792 bytes) per thread 409.1 GFLOPS 657.6 GFLOPS 702.9 GFLOPS 701.3 GFLOPS 712.3 GFLOPS 709.1 GFLOPS
15 (1920 bytes) per thread 409.2 GFLOPS 658.5 GFLOPS 702.8 GFLOPS 707.6 GFLOPS 712.0 GFLOPS 711.8 GFLOPS
16 (2048 bytes) per thread 409.1 GFLOPS 658.5 GFLOPS 702.7 GFLOPS 701.6 GFLOPS 712.4 GFLOPS 708.5 GFLOPS

Four at a time:

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 (256 bytes) per thread 204.5 GFLOPS 330.6 GFLOPS 494.1 GFLOPS 403.3 GFLOPS 502.1 GFLOPS 604.4 GFLOPS
2 (512 bytes) per thread 409.2 GFLOPS 649.7 GFLOPS 702.5 GFLOPS 508.3 GFLOPS 652.9 GFLOPS 638.7 GFLOPS
3 (768 bytes) per thread 408.8 GFLOPS 658.3 GFLOPS 702.6 GFLOPS 510.1 GFLOPS 630.0 GFLOPS 655.2 GFLOPS
4 (1024 bytes) per thread 409.2 GFLOPS 658.3 GFLOPS 702.6 GFLOPS 515.7 GFLOPS 637.4 GFLOPS 634.8 GFLOPS
5 (1280 bytes) per thread 409.2 GFLOPS 658.0 GFLOPS 702.9 GFLOPS 507.9 GFLOPS 651.6 GFLOPS 643.5 GFLOPS
6 (1536 bytes) per thread 409.3 GFLOPS 657.4 GFLOPS 702.7 GFLOPS 513.9 GFLOPS 641.4 GFLOPS 631.0 GFLOPS
7 (1792 bytes) per thread 409.2 GFLOPS 649.3 GFLOPS 702.8 GFLOPS 505.2 GFLOPS 652.3 GFLOPS 634.2 GFLOPS
8 (2048 bytes) per thread 409.1 GFLOPS 657.8 GFLOPS 702.4 GFLOPS 516.0 GFLOPS 629.6 GFLOPS 655.2 GFLOPS
9 (2304 bytes) per thread 409.2 GFLOPS 658.0 GFLOPS 702.3 GFLOPS 509.5 GFLOPS 652.2 GFLOPS 639.5 GFLOPS
10 (2560 bytes) per thread 409.1 GFLOPS 658.2 GFLOPS 702.7 GFLOPS 507.3 GFLOPS 652.0 GFLOPS 646.9 GFLOPS
11 (2816 bytes) per thread 409.2 GFLOPS 657.9 GFLOPS 702.6 GFLOPS 508.8 GFLOPS 651.9 GFLOPS 637.6 GFLOPS
12 (3072 bytes) per thread 409.0 GFLOPS 650.2 GFLOPS 702.6 GFLOPS 516.0 GFLOPS 653.0 GFLOPS 623.4 GFLOPS
13 (3328 bytes) per thread 409.2 GFLOPS 658.7 GFLOPS 702.7 GFLOPS 515.3 GFLOPS 652.6 GFLOPS 637.0 GFLOPS
14 (3584 bytes) per thread 409.6 GFLOPS 657.9 GFLOPS 702.8 GFLOPS 537.2 GFLOPS 622.0 GFLOPS 648.5 GFLOPS
15 (3840 bytes) per thread 409.4 GFLOPS 657.7 GFLOPS 702.8 GFLOPS 515.8 GFLOPS 653.2 GFLOPS 632.6 GFLOPS
16 (4096 bytes) per thread 409.2 GFLOPS 657.9 GFLOPS 702.9 GFLOPS 516.1 GFLOPS 652.9 GFLOPS 634.1 GFLOPS

Conclusions from all that:

  • No major performance improvement in an AMX cluster between M1 and M2
  • No major performance improvement from BF16 compared to FP16
  • No major performance improvement from using the two-at-a-time or four-at-a-time vector instruction modes

Thanks for the data! I guess that if you can make FP32 become the bottleneck in calculations so much that you consider BF16, it's best to just use GPU instead of AMX. I realized that GPT-4 can help me solve GPU FP64 emulation, so there's less need to use the AMX.

I am curious about performance of interleaved complex multiplication. M2 can oversubscribe the AMX without changing maximum FLOPS. Could your benchmarks test a small sequence of instructions that reads the interleaved numbers from memory and tries to achieve maximum FLOPS?* I'll still test Accelerate BLAS but this would provide a more direct theoretical benchmark. Apple has to have provided some kind of real-world improvement from this ISA change. Maybe it's fixing underutilization during complex multiplication.

*My hypothesis: M1 Max should never exceed ~37.5% theoretical FLOPS, while M2 should reach ~75% maximum FLOPS.

Also disappointing: AMX vector throughput is less than CPU NEON vector throughput. Perhaps that's why Apple's BLAS library consistently underperforms OpenBLAS by a factor of two. Instead of using the NEON units in a multithreaded setting, the CPUs all fight for the same AMX block with less theoretical FLOPS. The GPU would not have this limitation; its theoretical vector FLOPS actually > its theoretical matrix FLOPS.

For my purposes, I have the following FP64 throughputs:

  • CPU NEON: 388.5 GFLOPS
  • CPU AMX: 209.0 GFLOPS
  • GPU eFP64: 156.1-379.2 GFLOPS (1:28-68 from FP32)

The takeaway: when using any accelerator, your vector FP64 throughput is going to decrease. By approximately a factor of 2. The AMX is not better than the GPU in this regard. It would mostly help in strange instances of multiplying two FP64 matrices. I recall in the 2-stage eigendecomp. algorithm by Dongarra, it was technically O(n^3) computational complexity. But that's because it's ~n layers of O(n^2) computations. There would be little opportunity to multiply two massive matrices even with the bulge-chasing stage. This principle probably also applies to the rest of linear algebra - why OpenBLAS is faster than Accelerate for LU decomposition, or anything besides GEMM.

Apple has to have provided some kind of real-world improvement from this ISA change.

It looks like four-at-a-time gets (up to) double the throughput when any broadcast mode other than mode 0 is used (provided you're not bottlenecked on Z accumulators). This suggests another bottleneck in the equations: bandwidth out of the (seemingly combined) X/Y register file. Mode 0 requires two loads from the register file per iteration, whereas the other modes need two loads on the first iteration but can then get away with only one load per iteration for subsequent iterations.

As a concrete example, vecfp F32xF32=F32 four-at-a-time mode 0:

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 (256 bytes) per thread 102.5 GFLOPS 164.8 GFLOPS 247.1 GFLOPS 286.1 GFLOPS 264.1 GFLOPS 298.3 GFLOPS
2 (512 bytes) per thread 204.9 GFLOPS 329.5 GFLOPS 351.2 GFLOPS 206.3 GFLOPS 326.6 GFLOPS 324.6 GFLOPS
3 (768 bytes) per thread 205.0 GFLOPS 329.3 GFLOPS 351.4 GFLOPS 211.5 GFLOPS 273.0 GFLOPS 327.9 GFLOPS
4 (1024 bytes) per thread 205.0 GFLOPS 327.1 GFLOPS 351.5 GFLOPS 205.9 GFLOPS 323.3 GFLOPS 324.2 GFLOPS
5 (1280 bytes) per thread 204.9 GFLOPS 329.5 GFLOPS 351.3 GFLOPS 206.1 GFLOPS 326.7 GFLOPS 316.7 GFLOPS
6 (1536 bytes) per thread 204.9 GFLOPS 329.5 GFLOPS 351.5 GFLOPS 208.1 GFLOPS 326.5 GFLOPS 325.3 GFLOPS
7 (1792 bytes) per thread 204.9 GFLOPS 328.6 GFLOPS 351.5 GFLOPS 208.1 GFLOPS 326.5 GFLOPS 325.1 GFLOPS
8 (2048 bytes) per thread 205.0 GFLOPS 327.2 GFLOPS 351.5 GFLOPS 206.1 GFLOPS 320.9 GFLOPS 324.5 GFLOPS
9 (2304 bytes) per thread 205.0 GFLOPS 329.5 GFLOPS 351.5 GFLOPS 209.2 GFLOPS 318.4 GFLOPS 325.3 GFLOPS
10 (2560 bytes) per thread 205.0 GFLOPS 329.5 GFLOPS 351.5 GFLOPS 205.4 GFLOPS 322.5 GFLOPS 325.1 GFLOPS
11 (2816 bytes) per thread 205.0 GFLOPS 329.4 GFLOPS 351.4 GFLOPS 206.7 GFLOPS 326.6 GFLOPS 326.9 GFLOPS
12 (3072 bytes) per thread 204.9 GFLOPS 327.2 GFLOPS 351.4 GFLOPS 208.1 GFLOPS 323.8 GFLOPS 327.9 GFLOPS
13 (3328 bytes) per thread 204.9 GFLOPS 329.4 GFLOPS 351.5 GFLOPS 205.6 GFLOPS 326.6 GFLOPS 326.6 GFLOPS
14 (3584 bytes) per thread 205.0 GFLOPS 327.2 GFLOPS 351.5 GFLOPS 205.8 GFLOPS 326.6 GFLOPS 324.9 GFLOPS
15 (3840 bytes) per thread 205.0 GFLOPS 329.4 GFLOPS 351.3 GFLOPS 206.4 GFLOPS 325.6 GFLOPS 323.4 GFLOPS
16 (4096 bytes) per thread 205.0 GFLOPS 329.4 GFLOPS 351.4 GFLOPS 206.9 GFLOPS 326.5 GFLOPS 325.6 GFLOPS

Versus any other broadcast mode:

Z Accumulators 1 Thread 2 Threads 3 Threads 4 Threads 5 Threads 6 Threads
1 (256 bytes) per thread 102.5 GFLOPS 164.7 GFLOPS 247.1 GFLOPS 286.8 GFLOPS 357.7 GFLOPS 368.6 GFLOPS
2 (512 bytes) per thread 205.0 GFLOPS 329.5 GFLOPS 494.3 GFLOPS 464.8 GFLOPS 502.7 GFLOPS 540.1 GFLOPS
3 (768 bytes) per thread 307.4 GFLOPS 410.9 GFLOPS 528.5 GFLOPS 505.9 GFLOPS 530.3 GFLOPS 549.7 GFLOPS
4 (1024 bytes) per thread 409.9 GFLOPS 505.2 GFLOPS 548.0 GFLOPS 541.9 GFLOPS 551.1 GFLOPS 559.2 GFLOPS
5 (1280 bytes) per thread 409.7 GFLOPS 505.3 GFLOPS 547.7 GFLOPS 541.9 GFLOPS 554.1 GFLOPS 553.6 GFLOPS
6 (1536 bytes) per thread 409.8 GFLOPS 505.3 GFLOPS 547.9 GFLOPS 542.0 GFLOPS 550.2 GFLOPS 559.3 GFLOPS
7 (1792 bytes) per thread 409.9 GFLOPS 505.0 GFLOPS 547.8 GFLOPS 541.9 GFLOPS 550.4 GFLOPS 559.5 GFLOPS
8 (2048 bytes) per thread 409.8 GFLOPS 505.3 GFLOPS 547.8 GFLOPS 542.1 GFLOPS 550.5 GFLOPS 559.4 GFLOPS
9 (2304 bytes) per thread 409.9 GFLOPS 505.5 GFLOPS 547.1 GFLOPS 541.8 GFLOPS 550.5 GFLOPS 554.9 GFLOPS
10 (2560 bytes) per thread 409.9 GFLOPS 505.3 GFLOPS 548.1 GFLOPS 541.8 GFLOPS 550.4 GFLOPS 559.3 GFLOPS
11 (2816 bytes) per thread 409.9 GFLOPS 505.4 GFLOPS 547.8 GFLOPS 540.9 GFLOPS 545.5 GFLOPS 557.9 GFLOPS
12 (3072 bytes) per thread 409.9 GFLOPS 505.2 GFLOPS 548.1 GFLOPS 542.0 GFLOPS 550.4 GFLOPS 559.1 GFLOPS
13 (3328 bytes) per thread 409.9 GFLOPS 505.4 GFLOPS 547.8 GFLOPS 542.6 GFLOPS 550.4 GFLOPS 549.7 GFLOPS
14 (3584 bytes) per thread 409.8 GFLOPS 505.4 GFLOPS 547.9 GFLOPS 545.1 GFLOPS 550.4 GFLOPS 559.0 GFLOPS
15 (3840 bytes) per thread 410.0 GFLOPS 505.3 GFLOPS 547.8 GFLOPS 544.8 GFLOPS 550.4 GFLOPS 558.7 GFLOPS
16 (4096 bytes) per thread 409.8 GFLOPS 505.2 GFLOPS 547.9 GFLOPS 541.9 GFLOPS 550.5 GFLOPS 555.1 GFLOPS