A15/M2 Performance

Question

A15/M2 Performance

philipturner opened this issue a year ago · comments

After analyzing the die shots and speculating on performance, I came across a major change to the AMX architecture. Would you mind reading through the README to amx-benchmarks and helping me test the hypothesis? You don't need to rent an M2 from the cloud; I can test on my A15.

permalink for the hypothesis in question

Peter Cawley · Answer 1 · Thu Mar 16 2023 08:05:33 GMT+0800 (China Standard Time)

I'm afraid I don't follow exactly what your hypothesis is. I'm also not convinced by a number of things in your README.

I'll try to describe where I'm at. First, some terminology:

P CPU cluster: 1-4 performance CPU cores (I think my P CPU cluster is what you're calling a P block?)
E CPU cluster: 1-4 efficiency CPU cores
P AMX cluster: an AMX coprocessor associated with a P CPU cluster
E AMX cluster: an AMX coprocessor associated with an E CPU cluster
AMX cell: if you imagine the entire AMX computation grid as a 64-byte by 64-byte grid, a cell is an aligned 8-byte by 8-byte sub-grid, in which FMA/MAC operations are performed (a cell is approximately a PE in many Apple patents)

There's a 1:1 correspondence between CPU clusters and AMX clusters, and on die shots you'll see them colocated, along with a bunch of L2. Note that the clock speed of the AMX cluster needn't equal the clock speed of the associated CPU cluster.

To satisfy the needs of the ISA, each AMX cluster needs to contain:

X register file, which is at least 512 bytes per CPU core in the associated CPU cluster
Y register file, which is at least 512 bytes per CPU core in the associated CPU cluster
Z register file, which is at least 4096 bytes per CPU core in the associated CPU cluster
Some number of AMX cells, in some grid arrangement. This could be an 8x8 grid, or 8x2, or 4x4, or anything else of the form 1/2/4/8 by 1/2/4/8 (possibly what you're calling an AMX block is what I'm calling an 8x1 grid of AMX cells, meaning 8 blocks gives you an 8x8 grid, or 2 blocks gives you an 8x2 grid?)
Some control logic

The X and Y register files (or a combined X and Y register file) will be separate from the AMX cells, but the Z register file is likely split up and colocated in the AMX cells. The more AMX cells there are in a cluster, the smaller the amount of Z register file in each cell. If you had an 8x8 grid of cells, then you'd only need 64 bytes of Z (per CPU core) per cell. If you had an 8x2 grid of cells, then you'd need 256 bytes of Z (per CPU core) per cell. In particular, this could manifest itself as E AMX clusters having fewer cells than P AMX clusters, but each E cell being slightly larger than a P cell.

If you had an 8x8 grid of cells, then an entire AMX matrix instruction could be dispatched at once, whereas an 8x2 or 4x4 grid would require that each matrix instruction be split up into four pieces. If considering vector instructions, a 4x4 grid would require that vector instructions be split up into two pieces, whereas no splitting would be required for an 8x2 grid.

Each cell contains an F64 FMA circuit. Said circuit can be split up and used as four separate F32 FMA circuits, or split up in other ways for F16/BF16/integer-multiply-accumulate. That circuit might be split into four pipeline stages (e.g. 4 cycle compute latency), or at least the path from addend input to FMA output is four cycles. To fully saturate this circuit, each cell needs to be tasked with an F64 FMA per cycle (or four F32 FMAs, or ...). Given that the latency is 4 cycles, four distinct ranges of Z are required. If a matrix instruction refers to all 4096 bytes of Z, then multiple CPU cores need to be in play. This is why my performance tables have threads on one axis, and Z Accumulators per thread on the other axis, as both are routes to getting more Z in play, and sometimes you're constrained by Z before you're constrained by FMA circuits. Of course, adding threads can also put more AMX clusters in play (though you're at the whims of the scheduler as to whether your threads end up on different clusters or not).

The CPU cluster only sends instructions to the AMX cluster, not data. The ALUs and register files on the CPU cores are basically irrelevant to the AMX cluster. Data has to move via memory, and in particular via L2. The interesting question is how much bandwidth there is between the AMX cluster's register files and L2. A secondary question is how quickly CPU cores can enqueue AMX instructions - for M1 that means store ports, of which there are 2 per P core, and 1 per E core. Instruction fusion might let you enqueue two AMX instructions per port per cycle (again note that the dequeue needn't happen in the same clock domain).

If there's a major change in AMX between M1 and M2, I'd expect it to be in the layout of the E cells in the E AMX cluster, potentially switching from a 4x4 layout to an 8x2 layout (note that this logical layout, and needn't directly or exactly correspond to physical layout). This wouldn't change the number of E cells in the cluster, but would improve performance for vector operations. P AMX may well be 8x8 cells in both M1 and M2, with no major change there (except that said 8x8 might now be 4 copies of 8x2 rather than 4 copies of 4x4). Performance in most other ways would be mostly unchanged.

Philip Turner · Answer 2 · Thu Mar 16 2023 10:43:04 GMT+0800 (China Standard Time)

Thanks for the explanation! As stated on the top of my README, I think my current explanation is misleading - just haven't had the time to fix it. I still have some questions.

Have you seen the M2 Pro die shot? The P-AMX looks physically much different than with M1, almost double the area. That made me suspect Apple doubled performance with that generation (3.3 TFLOPS FP32 -> 7.3 TFLOPS FP32), and my M1 Max was seriously behind the performance of M2 Max. Hopefully this is not true.

162 x 75 pixels = 12150 pixels^2, 4 rectangles

120 x 185 pixels x 8/9 = 19733 pixels^2, 8 rectangles

Second, the vector throughput. M2 supports performing a vector instruction on 4 registers at once, while M1 only supports 1 register. Does the M2 have quadruple the vector throughput for non-GEMM-like operations? Or is it just an ISA optimization with no physical performance implications?

Third, clock speed. I know that if you activate more CPU cores, the entire blocks' max clock speed throttles. Does this throttling affect the AMX too, as if the AMX coprocessor is a fifth CPU core?

Finally, BF16 performance. Is there any inherent reason why FP16 FLOPs cannot exceed 2x FP32 FLOPs? Perhaps Z registers exceeding capacity because too many products are formed. If FP32 data consumes 2x the space of FP16 data, that would explain the following paradigm:

M1 Max FP16xFP16=FP16 (13.2 TFLOPS) would be without register bottleneck
M1 Max FP16xFP16=FP16 (6.6 TFLOPS) actual
M1 Max FP16xFP16=FP32 (13.2 TFLOPS) would be without register bottleneck
M1 Max FP16xFP16=FP32 (3.3 TFLOPS) actual = 1/2 16x16=16 <- special emphasis on this
M1 Max FP32xFP32=FP32 (3.3 TFLOPS) actual

Real-life code usually cannot reach 100% ALU utilization; 50-80% is common. Designing the AMX2 to hard-code 50% FP16 utilization would make sense when effective TFLOPS is:

M1 Max FP16xFP16=FP16 (6.6-10.6 TFLOPS) would be without register bottleneck
M1 Max FP16xFP16=FP32 (6.6-10.6 TFLOPS) would be without register bottleneck
M1 Max FP32xFP32=FP32 (1.7-2.6 TFLOPS) would be without register bottleneck

And as a final touch, simply remove half of the FP16xFP16=FP16 multipliers (M1 design). Redirect FP16xFP16=FP32 to the FP32xFP32=FP32 path and you use less transistors (M1 design). I hypothesize that Apple redesigned the A15/M2/M2 Pro (again, please look at the die shot) to (a) fix some Z-register bottleneck and through ISA improvements (b) improve effective TFLOPS. Then it would make practical sense to architect such that BF16 = 4x FP32, not 2x.

Peter Cawley · Answer 3 · Tue Mar 21 2023 06:17:53 GMT+0800 (China Standard Time)

For matfp FP64xFP64=FP64, I'm seeing the following on M1 Max:

Z Accumulators	1 Thread	2 Threads	3 Threads	4 Threads	5 Threads	6 Threads
1 (512 bytes) per thread	92.8 GFLOPS	185.5 GFLOPS	214.0 GFLOPS	285.0 GFLOPS	381.3 GFLOPS	394.0 GFLOPS
2 (1024 bytes) per thread	185.4 GFLOPS	370.7 GFLOPS	337.4 GFLOPS	474.0 GFLOPS	658.9 GFLOPS	610.7 GFLOPS
3 (1536 bytes) per thread	278.0 GFLOPS	556.5 GFLOPS	472.8 GFLOPS	572.9 GFLOPS	646.6 GFLOPS	718.9 GFLOPS
4 (2048 bytes) per thread	370.8 GFLOPS	742.0 GFLOPS	610.4 GFLOPS	730.0 GFLOPS	747.2 GFLOPS	772.0 GFLOPS
5 (2560 bytes) per thread	370.9 GFLOPS	742.1 GFLOPS	610.6 GFLOPS	745.9 GFLOPS	700.9 GFLOPS	731.3 GFLOPS
6 (3072 bytes) per thread	371.0 GFLOPS	741.3 GFLOPS	608.8 GFLOPS	730.9 GFLOPS	727.4 GFLOPS	735.6 GFLOPS
7 (3584 bytes) per thread	370.7 GFLOPS	740.9 GFLOPS	610.4 GFLOPS	752.7 GFLOPS	700.4 GFLOPS	769.8 GFLOPS
8 (4096 bytes) per thread	370.9 GFLOPS	741.2 GFLOPS	651.1 GFLOPS	745.1 GFLOPS	796.2 GFLOPS	780.4 GFLOPS

On M2, the same thing is:

Z Accumulators	1 Thread	2 Threads	3 Threads	4 Threads	5 Threads	6 Threads
1 (512 bytes) per thread	102.3 GFLOPS	164.6 GFLOPS	246.9 GFLOPS	231.2 GFLOPS	288.9 GFLOPS	318.3 GFLOPS
2 (1024 bytes) per thread	204.6 GFLOPS	252.5 GFLOPS	378.7 GFLOPS	354.2 GFLOPS	384.7 GFLOPS	436.2 GFLOPS
3 (1536 bytes) per thread	306.9 GFLOPS	351.4 GFLOPS	434.3 GFLOPS	421.4 GFLOPS	438.6 GFLOPS	464.0 GFLOPS
4 (2048 bytes) per thread	409.2 GFLOPS	452.2 GFLOPS	468.8 GFLOPS	472.5 GFLOPS	476.5 GFLOPS	479.4 GFLOPS
5 (2560 bytes) per thread	409.2 GFLOPS	452.2 GFLOPS	468.8 GFLOPS	472.6 GFLOPS	476.6 GFLOPS	479.4 GFLOPS
6 (3072 bytes) per thread	409.2 GFLOPS	452.2 GFLOPS	468.8 GFLOPS	472.6 GFLOPS	476.5 GFLOPS	479.4 GFLOPS
7 (3584 bytes) per thread	409.2 GFLOPS	452.1 GFLOPS	468.8 GFLOPS	472.6 GFLOPS	476.6 GFLOPS	479.4 GFLOPS
8 (4096 bytes) per thread	409.2 GFLOPS	452.2 GFLOPS	468.8 GFLOPS	472.5 GFLOPS	476.4 GFLOPS	479.3 GFLOPS

The "1 Thread" column sees a ~10% uplift in performance, consistent with M2 clocks being 10% higher than M1. M1 Max gets ~100% improvement going from 1 thread to 2, which is consistent with M1 Max having two P clusters. The dip when going to 3 threads is likely a consequence of the same thing, with there being no good scheduling of 3 threads onto two AMX clusters. M2 sees only small gains from increasing thread count, as a single thread is able to almost saturate the entire AMX cluster.

Going down to matfp FP32xFP32=FP32, I'm seeing this on M1 Max:

Z Accumulators	1 Thread	2 Threads	3 Threads	4 Threads	5 Threads	6 Threads
1 (1024 bytes) per thread	370.9 GFLOPS	741.8 GFLOPS	857.0 GFLOPS	1264.1 GFLOPS	1425.0 GFLOPS	1354.2 GFLOPS
2 (2048 bytes) per thread	742.5 GFLOPS	1484.2 GFLOPS	1349.1 GFLOPS	1800.1 GFLOPS	2249.9 GFLOPS	2389.8 GFLOPS
3 (3072 bytes) per thread	1112.9 GFLOPS	2224.5 GFLOPS	1891.6 GFLOPS	2521.1 GFLOPS	2591.9 GFLOPS	2879.2 GFLOPS
4 (4096 bytes) per thread	1482.9 GFLOPS	2967.4 GFLOPS	2442.5 GFLOPS	3102.8 GFLOPS	2806.2 GFLOPS	3007.9 GFLOPS

And on M2:

Z Accumulators	1 Thread	2 Threads	3 Threads	4 Threads	5 Threads	6 Threads
1 (1024 bytes) per thread	409.2 GFLOPS	658.5 GFLOPS	987.5 GFLOPS	925.6 GFLOPS	1155.2 GFLOPS	1272.9 GFLOPS
2 (2048 bytes) per thread	818.4 GFLOPS	1009.8 GFLOPS	1514.6 GFLOPS	1419.8 GFLOPS	1542.8 GFLOPS	1743.8 GFLOPS
3 (3072 bytes) per thread	1227.5 GFLOPS	1405.5 GFLOPS	1737.0 GFLOPS	1687.2 GFLOPS	1755.1 GFLOPS	1856.0 GFLOPS
4 (4096 bytes) per thread	1636.9 GFLOPS	1808.5 GFLOPS	1874.9 GFLOPS	1889.7 GFLOPS	1907.3 GFLOPS	1916.7 GFLOPS

FP32 is getting 4x the performance of FP64. Other than that, all the previous remarks apply here basically verbatim.

Going down to FP16xFP16=FP32, M1 Max:

Z Accumulators	1 Thread	2 Threads	3 Threads	4 Threads	5 Threads	6 Threads
1 (4096 bytes) per thread	1483.7 GFLOPS	2967.8 GFLOPS	2272.4 GFLOPS	2457.5 GFLOPS	2808.7 GFLOPS	2586.7 GFLOPS

M2:

Z Accumulators	1 Thread	2 Threads	3 Threads	4 Threads	5 Threads	6 Threads
1 (4096 bytes) per thread	1636.8 GFLOPS	1639.4 GFLOPS	1638.6 GFLOPS	1397.1 GFLOPS	1703.9 GFLOPS	1704.0 GFLOPS

Similar performance to FP32xFP32=FP32.

On M2, we can also do BF16xBF16=FP32:

Z Accumulators	1 Thread	2 Threads	3 Threads	4 Threads	5 Threads	6 Threads
1 (4096 bytes) per thread	1636.7 GFLOPS	1639.5 GFLOPS	1638.5 GFLOPS	1397.2 GFLOPS	1703.8 GFLOPS	1704.0 GFLOPS

Same performance as FP16.

Going down to FP16xFP16=FP16, M1 Max:

Z Accumulators	1 Thread	2 Threads	3 Threads	4 Threads	5 Threads	6 Threads
1 (2048 bytes) per thread	1484.0 GFLOPS	2965.0 GFLOPS	2693.9 GFLOPS	3599.1 GFLOPS	4543.6 GFLOPS	5124.9 GFLOPS
2 (4096 bytes) per thread	2967.1 GFLOPS	5926.1 GFLOPS	4881.3 GFLOPS	6189.6 GFLOPS	6072.7 GFLOPS	5337.5 GFLOPS

M2:

Z Accumulators	1 Thread	2 Threads	3 Threads	4 Threads	5 Threads	6 Threads
1 (2048 bytes) per thread	1637.0 GFLOPS	2026.2 GFLOPS	3029.0 GFLOPS	2867.8 GFLOPS	3384.8 GFLOPS	3408.3 GFLOPS
2 (4096 bytes) per thread	3272.8 GFLOPS	3614.3 GFLOPS	3751.6 GFLOPS	2905.1 GFLOPS	3394.9 GFLOPS	3407.8 GFLOPS

Twice the performance of FP32.

On M2, we can also do BF16xBF16=BF16:

Z Accumulators	1 Thread	2 Threads	3 Threads	4 Threads	5 Threads	6 Threads
1 (2048 bytes) per thread	1636.8 GFLOPS	2026.1 GFLOPS	3028.9 GFLOPS	2865.6 GFLOPS	3385.4 GFLOPS	3407.3 GFLOPS
2 (4096 bytes) per thread	3273.6 GFLOPS	3613.7 GFLOPS	3750.4 GFLOPS	2915.7 GFLOPS	3395.3 GFLOPS	3406.4 GFLOPS

Same performance as FP16.

Peter Cawley · Answer 4 · Tue Mar 21 2023 06:27:49 GMT+0800 (China Standard Time)

For vecfp FP64xFP64=FP64, I'm seeing the following on M1 Max:

Z Accumulators	1 Thread	2 Threads	3 Threads	4 Threads	5 Threads	6 Threads
1 (64 bytes) per thread	11.6 GFLOPS	23.2 GFLOPS	26.7 GFLOPS	39.4 GFLOPS	44.3 GFLOPS	52.0 GFLOPS
2 (128 bytes) per thread	23.2 GFLOPS	46.4 GFLOPS	53.5 GFLOPS	71.2 GFLOPS	88.9 GFLOPS	102.9 GFLOPS
3 (192 bytes) per thread	34.7 GFLOPS	69.5 GFLOPS	80.1 GFLOPS	107.0 GFLOPS	125.4 GFLOPS	122.5 GFLOPS
4 (256 bytes) per thread	46.3 GFLOPS	92.7 GFLOPS	106.6 GFLOPS	138.8 GFLOPS	165.0 GFLOPS	147.8 GFLOPS
5 (320 bytes) per thread	58.0 GFLOPS	115.9 GFLOPS	120.3 GFLOPS	147.7 GFLOPS	180.3 GFLOPS	175.4 GFLOPS
6 (384 bytes) per thread	69.5 GFLOPS	138.8 GFLOPS	135.9 GFLOPS	182.6 GFLOPS	192.8 GFLOPS	193.8 GFLOPS
7 (448 bytes) per thread	81.1 GFLOPS	162.2 GFLOPS	152.1 GFLOPS	192.3 GFLOPS	198.8 GFLOPS	202.2 GFLOPS
8 (512 bytes) per thread	92.8 GFLOPS	185.1 GFLOPS	168.4 GFLOPS	201.3 GFLOPS	201.1 GFLOPS	211.2 GFLOPS
9 (576 bytes) per thread	89.0 GFLOPS	177.3 GFLOPS	162.3 GFLOPS	198.6 GFLOPS	199.9 GFLOPS	206.6 GFLOPS
10 (640 bytes) per thread	91.9 GFLOPS	181.3 GFLOPS	165.6 GFLOPS	199.3 GFLOPS	200.7 GFLOPS	205.9 GFLOPS
11 (704 bytes) per thread	91.4 GFLOPS	181.6 GFLOPS	166.4 GFLOPS	202.1 GFLOPS	199.2 GFLOPS	198.5 GFLOPS
12 (768 bytes) per thread	92.7 GFLOPS	185.0 GFLOPS	167.8 GFLOPS	203.2 GFLOPS	201.7 GFLOPS	208.8 GFLOPS
13 (832 bytes) per thread	92.8 GFLOPS	185.3 GFLOPS	168.3 GFLOPS	187.4 GFLOPS	200.6 GFLOPS	208.3 GFLOPS
14 (896 bytes) per thread	92.8 GFLOPS	184.1 GFLOPS	168.6 GFLOPS	203.1 GFLOPS	201.4 GFLOPS	209.6 GFLOPS
15 (960 bytes) per thread	92.7 GFLOPS	185.4 GFLOPS	167.1 GFLOPS	202.7 GFLOPS	200.6 GFLOPS	208.1 GFLOPS
16 (1024 bytes) per thread	92.7 GFLOPS	185.0 GFLOPS	168.6 GFLOPS	202.4 GFLOPS	204.0 GFLOPS	209.0 GFLOPS

And on M2:

Z Accumulators	1 Thread	2 Threads	3 Threads	4 Threads	5 Threads	6 Threads
1 (64 bytes) per thread	12.8 GFLOPS	20.6 GFLOPS	30.9 GFLOPS	39.3 GFLOPS	49.1 GFLOPS	58.9 GFLOPS
2 (128 bytes) per thread	25.6 GFLOPS	41.1 GFLOPS	61.7 GFLOPS	78.6 GFLOPS	87.1 GFLOPS	85.2 GFLOPS
3 (192 bytes) per thread	38.4 GFLOPS	61.7 GFLOPS	89.6 GFLOPS	95.2 GFLOPS	108.1 GFLOPS	107.5 GFLOPS
4 (256 bytes) per thread	51.2 GFLOPS	82.3 GFLOPS	118.1 GFLOPS	115.8 GFLOPS	129.3 GFLOPS	132.0 GFLOPS
5 (320 bytes) per thread	63.9 GFLOPS	102.9 GFLOPS	139.6 GFLOPS	132.4 GFLOPS	143.0 GFLOPS	145.3 GFLOPS
6 (384 bytes) per thread	76.7 GFLOPS	113.1 GFLOPS	149.8 GFLOPS	143.2 GFLOPS	151.6 GFLOPS	154.9 GFLOPS
7 (448 bytes) per thread	89.5 GFLOPS	123.5 GFLOPS	150.3 GFLOPS	146.0 GFLOPS	152.0 GFLOPS	154.2 GFLOPS
8 (512 bytes) per thread	102.3 GFLOPS	134.8 GFLOPS	150.8 GFLOPS	150.6 GFLOPS	154.0 GFLOPS	155.1 GFLOPS
9 (576 bytes) per thread	102.3 GFLOPS	135.2 GFLOPS	151.5 GFLOPS	149.6 GFLOPS	153.0 GFLOPS	153.9 GFLOPS
10 (640 bytes) per thread	102.3 GFLOPS	135.3 GFLOPS	151.6 GFLOPS	150.6 GFLOPS	154.0 GFLOPS	154.5 GFLOPS
11 (704 bytes) per thread	102.2 GFLOPS	138.3 GFLOPS	154.9 GFLOPS	150.6 GFLOPS	153.6 GFLOPS	154.8 GFLOPS
12 (768 bytes) per thread	102.3 GFLOPS	135.2 GFLOPS	151.5 GFLOPS	150.7 GFLOPS	154.4 GFLOPS	154.9 GFLOPS
13 (832 bytes) per thread	102.3 GFLOPS	137.8 GFLOPS	154.3 GFLOPS	150.5 GFLOPS	153.6 GFLOPS	155.4 GFLOPS
14 (896 bytes) per thread	102.3 GFLOPS	135.1 GFLOPS	151.5 GFLOPS	150.5 GFLOPS	154.5 GFLOPS	155.4 GFLOPS
15 (960 bytes) per thread	102.3 GFLOPS	137.8 GFLOPS	154.2 GFLOPS	150.6 GFLOPS	153.7 GFLOPS	154.8 GFLOPS
16 (1024 bytes) per thread	102.3 GFLOPS	135.2 GFLOPS	151.5 GFLOPS	150.8 GFLOPS	154.1 GFLOPS	155.4 GFLOPS

No surprises hiding here.

M2 can dispatch 2 iterations at once:

Z Accumulators	1 Thread	2 Threads	3 Threads	4 Threads	5 Threads	6 Threads
1 (128 bytes) per thread	25.6 GFLOPS	41.1 GFLOPS	61.7 GFLOPS	78.8 GFLOPS	98.3 GFLOPS	117.6 GFLOPS
2 (256 bytes) per thread	51.1 GFLOPS	82.3 GFLOPS	123.4 GFLOPS	157.5 GFLOPS	174.2 GFLOPS	169.9 GFLOPS
3 (384 bytes) per thread	76.7 GFLOPS	123.4 GFLOPS	156.6 GFLOPS	169.2 GFLOPS	177.8 GFLOPS	176.7 GFLOPS
4 (512 bytes) per thread	102.3 GFLOPS	164.5 GFLOPS	175.6 GFLOPS	176.7 GFLOPS	178.0 GFLOPS	177.6 GFLOPS
5 (640 bytes) per thread	102.3 GFLOPS	164.6 GFLOPS	175.7 GFLOPS	176.7 GFLOPS	179.4 GFLOPS	177.5 GFLOPS
6 (768 bytes) per thread	102.3 GFLOPS	164.6 GFLOPS	175.7 GFLOPS	176.7 GFLOPS	179.3 GFLOPS	177.6 GFLOPS
7 (896 bytes) per thread	102.3 GFLOPS	164.6 GFLOPS	175.7 GFLOPS	176.7 GFLOPS	179.4 GFLOPS	178.1 GFLOPS
8 (1024 bytes) per thread	102.3 GFLOPS	164.6 GFLOPS	175.7 GFLOPS	178.0 GFLOPS	178.6 GFLOPS	177.7 GFLOPS
9 (1152 bytes) per thread	102.3 GFLOPS	164.5 GFLOPS	175.6 GFLOPS	177.9 GFLOPS	179.4 GFLOPS	178.4 GFLOPS
10 (1280 bytes) per thread	102.3 GFLOPS	164.5 GFLOPS	175.8 GFLOPS	177.8 GFLOPS	179.4 GFLOPS	179.2 GFLOPS
11 (1408 bytes) per thread	102.3 GFLOPS	164.6 GFLOPS	175.6 GFLOPS	176.7 GFLOPS	179.3 GFLOPS	178.3 GFLOPS
12 (1536 bytes) per thread	102.3 GFLOPS	164.5 GFLOPS	175.7 GFLOPS	176.7 GFLOPS	179.3 GFLOPS	177.9 GFLOPS
13 (1664 bytes) per thread	102.3 GFLOPS	164.6 GFLOPS	175.7 GFLOPS	176.7 GFLOPS	179.3 GFLOPS	177.6 GFLOPS
14 (1792 bytes) per thread	102.3 GFLOPS	164.6 GFLOPS	175.6 GFLOPS	176.8 GFLOPS	179.3 GFLOPS	177.6 GFLOPS
15 (1920 bytes) per thread	102.3 GFLOPS	164.6 GFLOPS	175.7 GFLOPS	176.6 GFLOPS	179.3 GFLOPS	178.9 GFLOPS
16 (2048 bytes) per thread	102.3 GFLOPS	164.6 GFLOPS	175.7 GFLOPS	176.7 GFLOPS	179.3 GFLOPS	178.7 GFLOPS

No gains to peak FLOPS to be found here, nor for 4 iterations at once:

Z Accumulators	1 Thread	2 Threads	3 Threads	4 Threads	5 Threads	6 Threads
1 (256 bytes) per thread	51.1 GFLOPS	82.5 GFLOPS	123.5 GFLOPS	93.2 GFLOPS	126.2 GFLOPS	152.9 GFLOPS
2 (512 bytes) per thread	102.3 GFLOPS	164.6 GFLOPS	175.7 GFLOPS	102.7 GFLOPS	163.2 GFLOPS	159.9 GFLOPS
3 (768 bytes) per thread	102.3 GFLOPS	163.5 GFLOPS	175.8 GFLOPS	102.9 GFLOPS	163.2 GFLOPS	162.3 GFLOPS
4 (1024 bytes) per thread	102.3 GFLOPS	164.5 GFLOPS	175.7 GFLOPS	104.9 GFLOPS	136.8 GFLOPS	163.1 GFLOPS
5 (1280 bytes) per thread	102.3 GFLOPS	163.5 GFLOPS	175.7 GFLOPS	103.9 GFLOPS	161.8 GFLOPS	163.9 GFLOPS
6 (1536 bytes) per thread	102.3 GFLOPS	163.4 GFLOPS	175.7 GFLOPS	102.9 GFLOPS	163.3 GFLOPS	163.4 GFLOPS
7 (1792 bytes) per thread	102.3 GFLOPS	163.5 GFLOPS	175.7 GFLOPS	104.9 GFLOPS	137.1 GFLOPS	158.5 GFLOPS
8 (2048 bytes) per thread	102.3 GFLOPS	164.5 GFLOPS	175.8 GFLOPS	102.8 GFLOPS	163.3 GFLOPS	162.1 GFLOPS
9 (2304 bytes) per thread	102.3 GFLOPS	164.4 GFLOPS	175.7 GFLOPS	104.0 GFLOPS	163.3 GFLOPS	162.9 GFLOPS
10 (2560 bytes) per thread	102.3 GFLOPS	164.6 GFLOPS	175.7 GFLOPS	102.5 GFLOPS	163.3 GFLOPS	163.1 GFLOPS
11 (2816 bytes) per thread	102.3 GFLOPS	164.6 GFLOPS	175.6 GFLOPS	104.0 GFLOPS	163.3 GFLOPS	162.3 GFLOPS
12 (3072 bytes) per thread	102.3 GFLOPS	164.5 GFLOPS	175.7 GFLOPS	103.7 GFLOPS	163.3 GFLOPS	160.8 GFLOPS
13 (3328 bytes) per thread	102.3 GFLOPS	164.5 GFLOPS	175.6 GFLOPS	102.8 GFLOPS	163.2 GFLOPS	162.7 GFLOPS
14 (3584 bytes) per thread	102.3 GFLOPS	164.6 GFLOPS	176.1 GFLOPS	103.9 GFLOPS	161.6 GFLOPS	162.9 GFLOPS
15 (3840 bytes) per thread	102.3 GFLOPS	164.4 GFLOPS	175.7 GFLOPS	103.3 GFLOPS	162.8 GFLOPS	162.7 GFLOPS
16 (4096 bytes) per thread	102.3 GFLOPS	164.6 GFLOPS	175.5 GFLOPS	102.6 GFLOPS	163.1 GFLOPS	162.5 GFLOPS

Going down to FP32xFP32=FP32, M1 Max:

Z Accumulators	1 Thread	2 Threads	3 Threads	4 Threads	5 Threads	6 Threads
1 (64 bytes) per thread	23.2 GFLOPS	46.4 GFLOPS	53.3 GFLOPS	81.1 GFLOPS	89.0 GFLOPS	104.1 GFLOPS
2 (128 bytes) per thread	46.4 GFLOPS	92.7 GFLOPS	106.5 GFLOPS	141.3 GFLOPS	176.8 GFLOPS	206.5 GFLOPS
3 (192 bytes) per thread	69.6 GFLOPS	139.1 GFLOPS	160.1 GFLOPS	213.3 GFLOPS	250.6 GFLOPS	244.9 GFLOPS
4 (256 bytes) per thread	92.7 GFLOPS	185.4 GFLOPS	214.0 GFLOPS	277.6 GFLOPS	325.5 GFLOPS	298.0 GFLOPS
5 (320 bytes) per thread	115.8 GFLOPS	231.7 GFLOPS	241.0 GFLOPS	321.3 GFLOPS	355.1 GFLOPS	347.7 GFLOPS
6 (384 bytes) per thread	139.0 GFLOPS	277.7 GFLOPS	271.2 GFLOPS	361.7 GFLOPS	387.1 GFLOPS	386.2 GFLOPS
7 (448 bytes) per thread	162.2 GFLOPS	324.2 GFLOPS	299.9 GFLOPS	383.4 GFLOPS	394.0 GFLOPS	400.9 GFLOPS
8 (512 bytes) per thread	185.5 GFLOPS	369.9 GFLOPS	335.8 GFLOPS	392.9 GFLOPS	405.8 GFLOPS	416.0 GFLOPS
9 (576 bytes) per thread	178.0 GFLOPS	353.4 GFLOPS	325.5 GFLOPS	396.9 GFLOPS	398.0 GFLOPS	409.2 GFLOPS
10 (640 bytes) per thread	183.1 GFLOPS	360.6 GFLOPS	335.3 GFLOPS	402.4 GFLOPS	401.2 GFLOPS	417.2 GFLOPS
11 (704 bytes) per thread	183.1 GFLOPS	363.0 GFLOPS	334.2 GFLOPS	403.2 GFLOPS	400.6 GFLOPS	415.8 GFLOPS
12 (768 bytes) per thread	185.2 GFLOPS	370.6 GFLOPS	335.5 GFLOPS	378.5 GFLOPS	397.7 GFLOPS	419.0 GFLOPS
13 (832 bytes) per thread	185.2 GFLOPS	369.4 GFLOPS	336.0 GFLOPS	404.2 GFLOPS	400.9 GFLOPS	414.1 GFLOPS
14 (896 bytes) per thread	185.5 GFLOPS	370.5 GFLOPS	336.4 GFLOPS	406.0 GFLOPS	402.9 GFLOPS	416.4 GFLOPS
15 (960 bytes) per thread	185.5 GFLOPS	370.0 GFLOPS	336.8 GFLOPS	405.7 GFLOPS	402.6 GFLOPS	409.6 GFLOPS
16 (1024 bytes) per thread	185.4 GFLOPS	370.4 GFLOPS	336.3 GFLOPS	406.0 GFLOPS	399.7 GFLOPS	405.3 GFLOPS

M2:

Z Accumulators	1 Thread	2 Threads	3 Threads	4 Threads	5 Threads	6 Threads
1 (64 bytes) per thread	25.6 GFLOPS	41.2 GFLOPS	61.7 GFLOPS	78.7 GFLOPS	98.4 GFLOPS	117.7 GFLOPS
2 (128 bytes) per thread	51.2 GFLOPS	82.3 GFLOPS	123.5 GFLOPS	157.7 GFLOPS	174.1 GFLOPS	170.4 GFLOPS
3 (192 bytes) per thread	76.7 GFLOPS	123.4 GFLOPS	179.5 GFLOPS	191.0 GFLOPS	216.9 GFLOPS	215.1 GFLOPS
4 (256 bytes) per thread	102.2 GFLOPS	164.6 GFLOPS	237.1 GFLOPS	231.8 GFLOPS	258.3 GFLOPS	263.3 GFLOPS
5 (320 bytes) per thread	127.8 GFLOPS	205.7 GFLOPS	279.1 GFLOPS	264.8 GFLOPS	285.7 GFLOPS	289.5 GFLOPS
6 (384 bytes) per thread	153.5 GFLOPS	226.0 GFLOPS	299.5 GFLOPS	286.6 GFLOPS	300.5 GFLOPS	308.3 GFLOPS
7 (448 bytes) per thread	179.0 GFLOPS	246.6 GFLOPS	300.6 GFLOPS	291.4 GFLOPS	302.4 GFLOPS	306.2 GFLOPS
8 (512 bytes) per thread	204.4 GFLOPS	269.7 GFLOPS	301.6 GFLOPS	299.4 GFLOPS	309.2 GFLOPS	310.4 GFLOPS
9 (576 bytes) per thread	204.6 GFLOPS	270.5 GFLOPS	302.9 GFLOPS	297.9 GFLOPS	304.7 GFLOPS	307.3 GFLOPS
10 (640 bytes) per thread	204.7 GFLOPS	270.3 GFLOPS	303.0 GFLOPS	300.2 GFLOPS	306.9 GFLOPS	308.9 GFLOPS
11 (704 bytes) per thread	204.6 GFLOPS	276.5 GFLOPS	308.4 GFLOPS	302.1 GFLOPS	305.8 GFLOPS	307.5 GFLOPS
12 (768 bytes) per thread	204.5 GFLOPS	270.5 GFLOPS	302.9 GFLOPS	299.9 GFLOPS	304.2 GFLOPS	307.5 GFLOPS
13 (832 bytes) per thread	204.6 GFLOPS	275.3 GFLOPS	307.9 GFLOPS	299.8 GFLOPS	306.4 GFLOPS	307.4 GFLOPS
14 (896 bytes) per thread	204.2 GFLOPS	270.5 GFLOPS	302.9 GFLOPS	299.6 GFLOPS	306.9 GFLOPS	310.6 GFLOPS
15 (960 bytes) per thread	204.5 GFLOPS	275.7 GFLOPS	308.5 GFLOPS	299.5 GFLOPS	305.5 GFLOPS	307.4 GFLOPS
16 (1024 bytes) per thread	204.6 GFLOPS	270.5 GFLOPS	302.8 GFLOPS	299.8 GFLOPS	306.9 GFLOPS	307.4 GFLOPS

M2 two at a time:

Z Accumulators	1 Thread	2 Threads	3 Threads	4 Threads	5 Threads	6 Threads
1 (128 bytes) per thread	51.2 GFLOPS	82.3 GFLOPS	123.5 GFLOPS	157.7 GFLOPS	196.4 GFLOPS	235.4 GFLOPS
2 (256 bytes) per thread	102.3 GFLOPS	164.6 GFLOPS	246.7 GFLOPS	316.2 GFLOPS	346.2 GFLOPS	339.7 GFLOPS
3 (384 bytes) per thread	153.4 GFLOPS	246.9 GFLOPS	313.0 GFLOPS	338.8 GFLOPS	355.6 GFLOPS	348.5 GFLOPS
4 (512 bytes) per thread	204.6 GFLOPS	328.8 GFLOPS	351.4 GFLOPS	355.6 GFLOPS	356.1 GFLOPS	349.9 GFLOPS
5 (640 bytes) per thread	204.5 GFLOPS	329.0 GFLOPS	351.2 GFLOPS	355.9 GFLOPS	356.1 GFLOPS	349.0 GFLOPS
6 (768 bytes) per thread	204.6 GFLOPS	329.3 GFLOPS	351.2 GFLOPS	350.8 GFLOPS	353.9 GFLOPS	356.3 GFLOPS
7 (896 bytes) per thread	204.5 GFLOPS	329.0 GFLOPS	346.6 GFLOPS	350.7 GFLOPS	356.1 GFLOPS	357.7 GFLOPS
8 (1024 bytes) per thread	204.7 GFLOPS	329.3 GFLOPS	351.4 GFLOPS	353.6 GFLOPS	358.2 GFLOPS	354.9 GFLOPS
9 (1152 bytes) per thread	204.6 GFLOPS	328.8 GFLOPS	351.2 GFLOPS	346.2 GFLOPS	358.4 GFLOPS	349.1 GFLOPS
10 (1280 bytes) per thread	204.5 GFLOPS	329.3 GFLOPS	351.6 GFLOPS	351.0 GFLOPS	354.9 GFLOPS	355.4 GFLOPS
11 (1408 bytes) per thread	204.4 GFLOPS	328.9 GFLOPS	351.3 GFLOPS	350.8 GFLOPS	358.3 GFLOPS	348.7 GFLOPS
12 (1536 bytes) per thread	204.6 GFLOPS	329.2 GFLOPS	351.4 GFLOPS	350.9 GFLOPS	356.0 GFLOPS	355.6 GFLOPS
13 (1664 bytes) per thread	204.4 GFLOPS	329.1 GFLOPS	351.1 GFLOPS	350.8 GFLOPS	356.1 GFLOPS	356.4 GFLOPS
14 (1792 bytes) per thread	204.7 GFLOPS	329.0 GFLOPS	351.2 GFLOPS	350.9 GFLOPS	356.2 GFLOPS	349.7 GFLOPS
15 (1920 bytes) per thread	204.5 GFLOPS	329.3 GFLOPS	351.2 GFLOPS	355.1 GFLOPS	356.0 GFLOPS	356.7 GFLOPS
16 (2048 bytes) per thread	204.6 GFLOPS	328.7 GFLOPS	351.0 GFLOPS	350.9 GFLOPS	356.1 GFLOPS	348.8 GFLOPS

And four at a time:

Z Accumulators	1 Thread	2 Threads	3 Threads	4 Threads	5 Threads	6 Threads
1 (256 bytes) per thread	102.3 GFLOPS	164.9 GFLOPS	247.0 GFLOPS	187.5 GFLOPS	250.9 GFLOPS	305.7 GFLOPS
2 (512 bytes) per thread	204.5 GFLOPS	326.9 GFLOPS	351.4 GFLOPS	208.2 GFLOPS	326.6 GFLOPS	323.5 GFLOPS
3 (768 bytes) per thread	204.6 GFLOPS	326.8 GFLOPS	351.4 GFLOPS	211.6 GFLOPS	320.5 GFLOPS	324.9 GFLOPS
4 (1024 bytes) per thread	204.6 GFLOPS	329.3 GFLOPS	351.3 GFLOPS	205.7 GFLOPS	326.6 GFLOPS	325.6 GFLOPS
5 (1280 bytes) per thread	204.6 GFLOPS	329.1 GFLOPS	351.2 GFLOPS	205.4 GFLOPS	326.5 GFLOPS	322.4 GFLOPS
6 (1536 bytes) per thread	204.6 GFLOPS	328.9 GFLOPS	351.4 GFLOPS	208.7 GFLOPS	318.2 GFLOPS	322.7 GFLOPS
7 (1792 bytes) per thread	204.6 GFLOPS	329.2 GFLOPS	351.4 GFLOPS	205.9 GFLOPS	326.4 GFLOPS	324.0 GFLOPS
8 (2048 bytes) per thread	204.5 GFLOPS	329.1 GFLOPS	351.4 GFLOPS	208.1 GFLOPS	326.5 GFLOPS	321.3 GFLOPS
9 (2304 bytes) per thread	204.6 GFLOPS	329.2 GFLOPS	351.4 GFLOPS	207.3 GFLOPS	323.6 GFLOPS	326.9 GFLOPS
10 (2560 bytes) per thread	204.6 GFLOPS	329.2 GFLOPS	351.4 GFLOPS	206.2 GFLOPS	320.8 GFLOPS	326.7 GFLOPS
11 (2816 bytes) per thread	204.5 GFLOPS	326.9 GFLOPS	346.5 GFLOPS	208.2 GFLOPS	326.3 GFLOPS	321.5 GFLOPS
12 (3072 bytes) per thread	204.6 GFLOPS	329.1 GFLOPS	351.1 GFLOPS	205.9 GFLOPS	326.5 GFLOPS	326.4 GFLOPS
13 (3328 bytes) per thread	204.5 GFLOPS	329.3 GFLOPS	351.4 GFLOPS	206.6 GFLOPS	323.5 GFLOPS	323.4 GFLOPS
14 (3584 bytes) per thread	204.6 GFLOPS	329.2 GFLOPS	351.4 GFLOPS	205.5 GFLOPS	326.4 GFLOPS	323.2 GFLOPS
15 (3840 bytes) per thread	204.6 GFLOPS	329.2 GFLOPS	351.0 GFLOPS	205.8 GFLOPS	326.5 GFLOPS	322.5 GFLOPS
16 (4096 bytes) per thread	204.6 GFLOPS	327.1 GFLOPS	351.2 GFLOPS	208.1 GFLOPS	324.0 GFLOPS	321.7 GFLOPS

FP16xFP16=FP16, M1 Max:

Z Accumulators	1 Thread	2 Threads	3 Threads	4 Threads	5 Threads	6 Threads
1 (64 bytes) per thread	46.3 GFLOPS	92.7 GFLOPS	106.7 GFLOPS	142.6 GFLOPS	189.5 GFLOPS	207.8 GFLOPS
2 (128 bytes) per thread	92.8 GFLOPS	185.2 GFLOPS	209.9 GFLOPS	312.7 GFLOPS	346.7 GFLOPS	408.2 GFLOPS
3 (192 bytes) per thread	139.2 GFLOPS	277.7 GFLOPS	317.5 GFLOPS	424.2 GFLOPS	507.9 GFLOPS	482.5 GFLOPS
4 (256 bytes) per thread	185.5 GFLOPS	370.4 GFLOPS	408.3 GFLOPS	552.6 GFLOPS	654.9 GFLOPS	589.2 GFLOPS
5 (320 bytes) per thread	231.5 GFLOPS	463.4 GFLOPS	479.3 GFLOPS	630.2 GFLOPS	711.5 GFLOPS	710.2 GFLOPS
6 (384 bytes) per thread	277.8 GFLOPS	556.3 GFLOPS	540.5 GFLOPS	721.1 GFLOPS	813.0 GFLOPS	734.5 GFLOPS
7 (448 bytes) per thread	324.8 GFLOPS	647.3 GFLOPS	607.7 GFLOPS	769.6 GFLOPS	789.4 GFLOPS	802.9 GFLOPS
8 (512 bytes) per thread	371.0 GFLOPS	739.4 GFLOPS	672.2 GFLOPS	810.8 GFLOPS	824.8 GFLOPS	813.7 GFLOPS
9 (576 bytes) per thread	354.5 GFLOPS	712.8 GFLOPS	652.8 GFLOPS	792.6 GFLOPS	796.6 GFLOPS	787.1 GFLOPS
10 (640 bytes) per thread	365.5 GFLOPS	717.8 GFLOPS	662.3 GFLOPS	798.5 GFLOPS	788.0 GFLOPS	828.2 GFLOPS
11 (704 bytes) per thread	365.2 GFLOPS	731.5 GFLOPS	670.2 GFLOPS	799.1 GFLOPS	806.8 GFLOPS	825.8 GFLOPS
12 (768 bytes) per thread	371.3 GFLOPS	740.5 GFLOPS	670.5 GFLOPS	797.7 GFLOPS	804.7 GFLOPS	830.4 GFLOPS
13 (832 bytes) per thread	370.7 GFLOPS	741.2 GFLOPS	671.3 GFLOPS	812.3 GFLOPS	801.4 GFLOPS	826.5 GFLOPS
14 (896 bytes) per thread	371.1 GFLOPS	740.0 GFLOPS	671.7 GFLOPS	806.5 GFLOPS	804.5 GFLOPS	835.0 GFLOPS
15 (960 bytes) per thread	370.6 GFLOPS	740.6 GFLOPS	671.1 GFLOPS	805.0 GFLOPS	804.6 GFLOPS	831.8 GFLOPS
16 (1024 bytes) per thread	369.1 GFLOPS	737.8 GFLOPS	816.6 GFLOPS	808.5 GFLOPS	807.5 GFLOPS	818.8 GFLOPS

M2:

Z Accumulators	1 Thread	2 Threads	3 Threads	4 Threads	5 Threads	6 Threads
1 (64 bytes) per thread	51.2 GFLOPS	82.3 GFLOPS	123.4 GFLOPS	157.6 GFLOPS	197.4 GFLOPS	234.4 GFLOPS
2 (128 bytes) per thread	102.3 GFLOPS	164.5 GFLOPS	246.9 GFLOPS	314.6 GFLOPS	350.2 GFLOPS	339.0 GFLOPS
3 (192 bytes) per thread	153.5 GFLOPS	246.8 GFLOPS	360.2 GFLOPS	380.8 GFLOPS	434.3 GFLOPS	424.7 GFLOPS
4 (256 bytes) per thread	204.6 GFLOPS	329.3 GFLOPS	471.7 GFLOPS	463.3 GFLOPS	519.7 GFLOPS	522.5 GFLOPS
5 (320 bytes) per thread	255.8 GFLOPS	411.0 GFLOPS	557.3 GFLOPS	524.1 GFLOPS	568.5 GFLOPS	572.6 GFLOPS
6 (384 bytes) per thread	306.8 GFLOPS	451.8 GFLOPS	599.1 GFLOPS	571.5 GFLOPS	607.2 GFLOPS	607.2 GFLOPS
7 (448 bytes) per thread	358.2 GFLOPS	493.7 GFLOPS	601.1 GFLOPS	580.0 GFLOPS	591.6 GFLOPS	610.4 GFLOPS
8 (512 bytes) per thread	409.2 GFLOPS	538.5 GFLOPS	603.2 GFLOPS	594.4 GFLOPS	608.4 GFLOPS	620.5 GFLOPS
9 (576 bytes) per thread	408.9 GFLOPS	540.7 GFLOPS	605.6 GFLOPS	583.0 GFLOPS	604.4 GFLOPS	617.9 GFLOPS
10 (640 bytes) per thread	408.8 GFLOPS	540.9 GFLOPS	605.4 GFLOPS	594.5 GFLOPS	614.2 GFLOPS	616.3 GFLOPS
11 (704 bytes) per thread	409.1 GFLOPS	553.3 GFLOPS	614.4 GFLOPS	603.7 GFLOPS	606.4 GFLOPS	614.8 GFLOPS
12 (768 bytes) per thread	409.2 GFLOPS	540.5 GFLOPS	605.8 GFLOPS	599.9 GFLOPS	608.6 GFLOPS	620.3 GFLOPS
13 (832 bytes) per thread	409.4 GFLOPS	550.2 GFLOPS	614.5 GFLOPS	594.7 GFLOPS	606.0 GFLOPS	608.0 GFLOPS
14 (896 bytes) per thread	408.7 GFLOPS	538.7 GFLOPS	606.1 GFLOPS	594.9 GFLOPS	608.5 GFLOPS	618.3 GFLOPS
15 (960 bytes) per thread	409.1 GFLOPS	551.0 GFLOPS	614.5 GFLOPS	594.3 GFLOPS	615.0 GFLOPS	607.6 GFLOPS
16 (1024 bytes) per thread	408.8 GFLOPS	541.0 GFLOPS	605.3 GFLOPS	594.7 GFLOPS	608.9 GFLOPS	621.0 GFLOPS

M2, two at a time:

Z Accumulators	1 Thread	2 Threads	3 Threads	4 Threads	5 Threads	6 Threads
1 (128 bytes) per thread	102.3 GFLOPS	164.6 GFLOPS	246.8 GFLOPS	314.4 GFLOPS	396.7 GFLOPS	472.1 GFLOPS
2 (256 bytes) per thread	204.6 GFLOPS	329.3 GFLOPS	493.8 GFLOPS	632.7 GFLOPS	696.4 GFLOPS	677.2 GFLOPS
3 (384 bytes) per thread	306.5 GFLOPS	493.9 GFLOPS	626.3 GFLOPS	680.0 GFLOPS	710.9 GFLOPS	702.7 GFLOPS
4 (512 bytes) per thread	409.0 GFLOPS	657.8 GFLOPS	701.5 GFLOPS	702.0 GFLOPS	712.3 GFLOPS	712.9 GFLOPS
5 (640 bytes) per thread	409.2 GFLOPS	658.7 GFLOPS	702.3 GFLOPS	708.4 GFLOPS	712.3 GFLOPS	714.1 GFLOPS
6 (768 bytes) per thread	409.2 GFLOPS	658.1 GFLOPS	702.8 GFLOPS	701.5 GFLOPS	712.4 GFLOPS	698.1 GFLOPS
7 (896 bytes) per thread	409.3 GFLOPS	658.4 GFLOPS	702.8 GFLOPS	710.9 GFLOPS	712.2 GFLOPS	709.4 GFLOPS
8 (1024 bytes) per thread	408.9 GFLOPS	658.3 GFLOPS	702.6 GFLOPS	700.4 GFLOPS	712.4 GFLOPS	712.6 GFLOPS
9 (1152 bytes) per thread	409.1 GFLOPS	658.4 GFLOPS	702.8 GFLOPS	701.3 GFLOPS	712.2 GFLOPS	698.1 GFLOPS
10 (1280 bytes) per thread	409.1 GFLOPS	658.5 GFLOPS	702.7 GFLOPS	701.4 GFLOPS	712.3 GFLOPS	697.7 GFLOPS
11 (1408 bytes) per thread	409.4 GFLOPS	658.6 GFLOPS	702.5 GFLOPS	702.0 GFLOPS	712.1 GFLOPS	698.6 GFLOPS
12 (1536 bytes) per thread	409.2 GFLOPS	658.4 GFLOPS	702.7 GFLOPS	704.1 GFLOPS	712.1 GFLOPS	697.9 GFLOPS
13 (1664 bytes) per thread	409.2 GFLOPS	657.2 GFLOPS	702.6 GFLOPS	701.9 GFLOPS	712.3 GFLOPS	698.7 GFLOPS
14 (1792 bytes) per thread	409.1 GFLOPS	658.3 GFLOPS	702.2 GFLOPS	711.5 GFLOPS	712.0 GFLOPS	710.3 GFLOPS
15 (1920 bytes) per thread	409.0 GFLOPS	657.4 GFLOPS	702.4 GFLOPS	701.7 GFLOPS	712.3 GFLOPS	714.4 GFLOPS
16 (2048 bytes) per thread	409.0 GFLOPS	658.2 GFLOPS	702.7 GFLOPS	707.3 GFLOPS	707.0 GFLOPS	715.4 GFLOPS

M2, four at a time:

Z Accumulators	1 Thread	2 Threads	3 Threads	4 Threads	5 Threads	6 Threads
1 (256 bytes) per thread	204.6 GFLOPS	329.1 GFLOPS	493.7 GFLOPS	403.3 GFLOPS	502.0 GFLOPS	608.1 GFLOPS
2 (512 bytes) per thread	409.1 GFLOPS	657.9 GFLOPS	702.7 GFLOPS	516.3 GFLOPS	636.5 GFLOPS	629.4 GFLOPS
3 (768 bytes) per thread	409.2 GFLOPS	658.1 GFLOPS	702.8 GFLOPS	510.1 GFLOPS	652.8 GFLOPS	642.5 GFLOPS
4 (1024 bytes) per thread	409.3 GFLOPS	658.3 GFLOPS	702.3 GFLOPS	504.4 GFLOPS	652.8 GFLOPS	644.4 GFLOPS
5 (1280 bytes) per thread	409.1 GFLOPS	658.4 GFLOPS	702.6 GFLOPS	515.9 GFLOPS	653.2 GFLOPS	648.9 GFLOPS
6 (1536 bytes) per thread	409.0 GFLOPS	658.4 GFLOPS	702.6 GFLOPS	516.0 GFLOPS	652.1 GFLOPS	642.9 GFLOPS
7 (1792 bytes) per thread	409.2 GFLOPS	658.2 GFLOPS	702.5 GFLOPS	510.2 GFLOPS	466.7 GFLOPS	643.1 GFLOPS
8 (2048 bytes) per thread	409.1 GFLOPS	658.1 GFLOPS	702.2 GFLOPS	516.1 GFLOPS	651.8 GFLOPS	643.0 GFLOPS
9 (2304 bytes) per thread	409.3 GFLOPS	657.7 GFLOPS	702.2 GFLOPS	501.7 GFLOPS	619.4 GFLOPS	646.4 GFLOPS
10 (2560 bytes) per thread	409.2 GFLOPS	658.7 GFLOPS	702.8 GFLOPS	516.2 GFLOPS	652.1 GFLOPS	635.1 GFLOPS
11 (2816 bytes) per thread	409.3 GFLOPS	650.2 GFLOPS	702.6 GFLOPS	504.3 GFLOPS	652.9 GFLOPS	638.9 GFLOPS
12 (3072 bytes) per thread	409.0 GFLOPS	658.4 GFLOPS	701.7 GFLOPS	515.3 GFLOPS	653.2 GFLOPS	643.3 GFLOPS
13 (3328 bytes) per thread	409.2 GFLOPS	650.1 GFLOPS	702.6 GFLOPS	516.2 GFLOPS	652.5 GFLOPS	636.3 GFLOPS
14 (3584 bytes) per thread	409.3 GFLOPS	649.5 GFLOPS	703.0 GFLOPS	516.0 GFLOPS	652.6 GFLOPS	627.6 GFLOPS
15 (3840 bytes) per thread	409.4 GFLOPS	658.4 GFLOPS	702.8 GFLOPS	516.2 GFLOPS	652.6 GFLOPS	640.9 GFLOPS
16 (4096 bytes) per thread	409.1 GFLOPS	658.3 GFLOPS	702.9 GFLOPS	504.0 GFLOPS	652.5 GFLOPS	638.1 GFLOPS

BF16xBF16=BF16, M2:

Z Accumulators	1 Thread	2 Threads	3 Threads	4 Threads	5 Threads	6 Threads
1 (64 bytes) per thread	51.2 GFLOPS	82.3 GFLOPS	123.5 GFLOPS	157.6 GFLOPS	197.1 GFLOPS	236.2 GFLOPS
2 (128 bytes) per thread	102.3 GFLOPS	164.6 GFLOPS	246.7 GFLOPS	314.0 GFLOPS	348.5 GFLOPS	337.2 GFLOPS
3 (192 bytes) per thread	153.3 GFLOPS	246.8 GFLOPS	358.2 GFLOPS	380.4 GFLOPS	431.5 GFLOPS	426.8 GFLOPS
4 (256 bytes) per thread	204.5 GFLOPS	329.2 GFLOPS	473.3 GFLOPS	464.6 GFLOPS	516.2 GFLOPS	532.0 GFLOPS
5 (320 bytes) per thread	255.3 GFLOPS	410.9 GFLOPS	558.2 GFLOPS	528.9 GFLOPS	570.6 GFLOPS	572.0 GFLOPS
6 (384 bytes) per thread	306.8 GFLOPS	452.0 GFLOPS	599.3 GFLOPS	572.4 GFLOPS	605.5 GFLOPS	607.1 GFLOPS
7 (448 bytes) per thread	357.9 GFLOPS	494.2 GFLOPS	601.6 GFLOPS	579.4 GFLOPS	601.6 GFLOPS	613.0 GFLOPS
8 (512 bytes) per thread	409.4 GFLOPS	538.6 GFLOPS	602.6 GFLOPS	594.5 GFLOPS	617.9 GFLOPS	616.6 GFLOPS
9 (576 bytes) per thread	409.2 GFLOPS	540.3 GFLOPS	606.1 GFLOPS	600.5 GFLOPS	604.3 GFLOPS	605.5 GFLOPS
10 (640 bytes) per thread	408.9 GFLOPS	539.8 GFLOPS	605.7 GFLOPS	594.9 GFLOPS	608.8 GFLOPS	611.5 GFLOPS
11 (704 bytes) per thread	408.7 GFLOPS	553.3 GFLOPS	614.7 GFLOPS	595.3 GFLOPS	606.2 GFLOPS	618.3 GFLOPS
12 (768 bytes) per thread	409.2 GFLOPS	540.9 GFLOPS	605.6 GFLOPS	598.7 GFLOPS	611.0 GFLOPS	608.8 GFLOPS
13 (832 bytes) per thread	409.2 GFLOPS	550.6 GFLOPS	614.4 GFLOPS	599.6 GFLOPS	611.2 GFLOPS	608.7 GFLOPS
14 (896 bytes) per thread	409.4 GFLOPS	540.5 GFLOPS	606.1 GFLOPS	594.9 GFLOPS	608.4 GFLOPS	612.6 GFLOPS
15 (960 bytes) per thread	408.7 GFLOPS	551.0 GFLOPS	614.7 GFLOPS	593.0 GFLOPS	607.4 GFLOPS	607.5 GFLOPS
16 (1024 bytes) per thread	409.0 GFLOPS	540.6 GFLOPS	605.6 GFLOPS	594.6 GFLOPS	616.6 GFLOPS	608.4 GFLOPS

Two at a time:

Z Accumulators	1 Thread	2 Threads	3 Threads	4 Threads	5 Threads	6 Threads
1 (128 bytes) per thread	102.3 GFLOPS	164.6 GFLOPS	246.9 GFLOPS	315.3 GFLOPS	392.8 GFLOPS	468.2 GFLOPS
2 (256 bytes) per thread	204.6 GFLOPS	329.1 GFLOPS	493.9 GFLOPS	629.2 GFLOPS	691.9 GFLOPS	681.5 GFLOPS
3 (384 bytes) per thread	306.9 GFLOPS	493.3 GFLOPS	626.1 GFLOPS	677.4 GFLOPS	711.0 GFLOPS	699.5 GFLOPS
4 (512 bytes) per thread	409.5 GFLOPS	658.3 GFLOPS	702.9 GFLOPS	707.9 GFLOPS	712.2 GFLOPS	697.7 GFLOPS
5 (640 bytes) per thread	409.3 GFLOPS	657.7 GFLOPS	702.5 GFLOPS	710.4 GFLOPS	712.1 GFLOPS	708.3 GFLOPS
6 (768 bytes) per thread	409.2 GFLOPS	657.7 GFLOPS	702.5 GFLOPS	702.1 GFLOPS	712.2 GFLOPS	697.2 GFLOPS
7 (896 bytes) per thread	409.0 GFLOPS	658.3 GFLOPS	702.6 GFLOPS	705.6 GFLOPS	712.2 GFLOPS	712.9 GFLOPS
8 (1024 bytes) per thread	409.1 GFLOPS	657.3 GFLOPS	702.3 GFLOPS	701.8 GFLOPS	712.0 GFLOPS	697.7 GFLOPS
9 (1152 bytes) per thread	409.1 GFLOPS	658.4 GFLOPS	702.7 GFLOPS	701.5 GFLOPS	712.1 GFLOPS	697.5 GFLOPS
10 (1280 bytes) per thread	409.0 GFLOPS	657.5 GFLOPS	702.7 GFLOPS	711.4 GFLOPS	712.3 GFLOPS	713.0 GFLOPS
11 (1408 bytes) per thread	409.0 GFLOPS	658.5 GFLOPS	702.5 GFLOPS	701.5 GFLOPS	712.4 GFLOPS	714.4 GFLOPS
12 (1536 bytes) per thread	409.8 GFLOPS	657.8 GFLOPS	702.9 GFLOPS	702.0 GFLOPS	712.2 GFLOPS	696.9 GFLOPS
13 (1664 bytes) per thread	409.1 GFLOPS	658.5 GFLOPS	701.4 GFLOPS	702.0 GFLOPS	712.3 GFLOPS	698.0 GFLOPS
14 (1792 bytes) per thread	409.1 GFLOPS	657.6 GFLOPS	702.9 GFLOPS	701.3 GFLOPS	712.3 GFLOPS	709.1 GFLOPS
15 (1920 bytes) per thread	409.2 GFLOPS	658.5 GFLOPS	702.8 GFLOPS	707.6 GFLOPS	712.0 GFLOPS	711.8 GFLOPS
16 (2048 bytes) per thread	409.1 GFLOPS	658.5 GFLOPS	702.7 GFLOPS	701.6 GFLOPS	712.4 GFLOPS	708.5 GFLOPS

Four at a time:

Z Accumulators	1 Thread	2 Threads	3 Threads	4 Threads	5 Threads	6 Threads
1 (256 bytes) per thread	204.5 GFLOPS	330.6 GFLOPS	494.1 GFLOPS	403.3 GFLOPS	502.1 GFLOPS	604.4 GFLOPS
2 (512 bytes) per thread	409.2 GFLOPS	649.7 GFLOPS	702.5 GFLOPS	508.3 GFLOPS	652.9 GFLOPS	638.7 GFLOPS
3 (768 bytes) per thread	408.8 GFLOPS	658.3 GFLOPS	702.6 GFLOPS	510.1 GFLOPS	630.0 GFLOPS	655.2 GFLOPS
4 (1024 bytes) per thread	409.2 GFLOPS	658.3 GFLOPS	702.6 GFLOPS	515.7 GFLOPS	637.4 GFLOPS	634.8 GFLOPS
5 (1280 bytes) per thread	409.2 GFLOPS	658.0 GFLOPS	702.9 GFLOPS	507.9 GFLOPS	651.6 GFLOPS	643.5 GFLOPS
6 (1536 bytes) per thread	409.3 GFLOPS	657.4 GFLOPS	702.7 GFLOPS	513.9 GFLOPS	641.4 GFLOPS	631.0 GFLOPS
7 (1792 bytes) per thread	409.2 GFLOPS	649.3 GFLOPS	702.8 GFLOPS	505.2 GFLOPS	652.3 GFLOPS	634.2 GFLOPS
8 (2048 bytes) per thread	409.1 GFLOPS	657.8 GFLOPS	702.4 GFLOPS	516.0 GFLOPS	629.6 GFLOPS	655.2 GFLOPS
9 (2304 bytes) per thread	409.2 GFLOPS	658.0 GFLOPS	702.3 GFLOPS	509.5 GFLOPS	652.2 GFLOPS	639.5 GFLOPS
10 (2560 bytes) per thread	409.1 GFLOPS	658.2 GFLOPS	702.7 GFLOPS	507.3 GFLOPS	652.0 GFLOPS	646.9 GFLOPS
11 (2816 bytes) per thread	409.2 GFLOPS	657.9 GFLOPS	702.6 GFLOPS	508.8 GFLOPS	651.9 GFLOPS	637.6 GFLOPS
12 (3072 bytes) per thread	409.0 GFLOPS	650.2 GFLOPS	702.6 GFLOPS	516.0 GFLOPS	653.0 GFLOPS	623.4 GFLOPS
13 (3328 bytes) per thread	409.2 GFLOPS	658.7 GFLOPS	702.7 GFLOPS	515.3 GFLOPS	652.6 GFLOPS	637.0 GFLOPS
14 (3584 bytes) per thread	409.6 GFLOPS	657.9 GFLOPS	702.8 GFLOPS	537.2 GFLOPS	622.0 GFLOPS	648.5 GFLOPS
15 (3840 bytes) per thread	409.4 GFLOPS	657.7 GFLOPS	702.8 GFLOPS	515.8 GFLOPS	653.2 GFLOPS	632.6 GFLOPS
16 (4096 bytes) per thread	409.2 GFLOPS	657.9 GFLOPS	702.9 GFLOPS	516.1 GFLOPS	652.9 GFLOPS	634.1 GFLOPS

Peter Cawley · Answer 5 · Tue Mar 21 2023 06:38:35 GMT+0800 (China Standard Time)

Conclusions from all that:

No major performance improvement in an AMX cluster between M1 and M2
No major performance improvement from BF16 compared to FP16
No major performance improvement from using the two-at-a-time or four-at-a-time vector instruction modes

Philip Turner · Answer 6 · Tue Mar 21 2023 07:40:56 GMT+0800 (China Standard Time)

Thanks for the data! I guess that if you can make FP32 become the bottleneck in calculations so much that you consider BF16, it's best to just use GPU instead of AMX. I realized that GPT-4 can help me solve GPU FP64 emulation, so there's less need to use the AMX.

I am curious about performance of interleaved complex multiplication. M2 can oversubscribe the AMX without changing maximum FLOPS. Could your benchmarks test a small sequence of instructions that reads the interleaved numbers from memory and tries to achieve maximum FLOPS?* I'll still test Accelerate BLAS but this would provide a more direct theoretical benchmark. Apple has to have provided some kind of real-world improvement from this ISA change. Maybe it's fixing underutilization during complex multiplication.

*My hypothesis: M1 Max should never exceed ~37.5% theoretical FLOPS, while M2 should reach ~75% maximum FLOPS.

Also disappointing: AMX vector throughput is less than CPU NEON vector throughput. Perhaps that's why Apple's BLAS library consistently underperforms OpenBLAS by a factor of two. Instead of using the NEON units in a multithreaded setting, the CPUs all fight for the same AMX block with less theoretical FLOPS. The GPU would not have this limitation; its theoretical vector FLOPS actually > its theoretical matrix FLOPS.

For my purposes, I have the following FP64 throughputs:

CPU NEON: 388.5 GFLOPS
CPU AMX: 209.0 GFLOPS
GPU eFP64: 156.1-379.2 GFLOPS (1:28-68 from FP32)

The takeaway: when using any accelerator, your vector FP64 throughput is going to decrease. By approximately a factor of 2. The AMX is not better than the GPU in this regard. It would mostly help in strange instances of multiplying two FP64 matrices. I recall in the 2-stage eigendecomp. algorithm by Dongarra, it was technically O(n^3) computational complexity. But that's because it's ~n layers of O(n^2) computations. There would be little opportunity to multiply two massive matrices even with the bulge-chasing stage. This principle probably also applies to the rest of linear algebra - why OpenBLAS is faster than Accelerate for LU decomposition, or anything besides GEMM.

Peter Cawley · Answer 7 · Wed Mar 22 2023 06:22:26 GMT+0800 (China Standard Time)

Apple has to have provided some kind of real-world improvement from this ISA change.

It looks like four-at-a-time gets (up to) double the throughput when any broadcast mode other than mode 0 is used (provided you're not bottlenecked on Z accumulators). This suggests another bottleneck in the equations: bandwidth out of the (seemingly combined) X/Y register file. Mode 0 requires two loads from the register file per iteration, whereas the other modes need two loads on the first iteration but can then get away with only one load per iteration for subsequent iterations.

As a concrete example, vecfp F32xF32=F32 four-at-a-time mode 0:

Z Accumulators	1 Thread	2 Threads	3 Threads	4 Threads	5 Threads	6 Threads
1 (256 bytes) per thread	102.5 GFLOPS	164.8 GFLOPS	247.1 GFLOPS	286.1 GFLOPS	264.1 GFLOPS	298.3 GFLOPS
2 (512 bytes) per thread	204.9 GFLOPS	329.5 GFLOPS	351.2 GFLOPS	206.3 GFLOPS	326.6 GFLOPS	324.6 GFLOPS
3 (768 bytes) per thread	205.0 GFLOPS	329.3 GFLOPS	351.4 GFLOPS	211.5 GFLOPS	273.0 GFLOPS	327.9 GFLOPS
4 (1024 bytes) per thread	205.0 GFLOPS	327.1 GFLOPS	351.5 GFLOPS	205.9 GFLOPS	323.3 GFLOPS	324.2 GFLOPS
5 (1280 bytes) per thread	204.9 GFLOPS	329.5 GFLOPS	351.3 GFLOPS	206.1 GFLOPS	326.7 GFLOPS	316.7 GFLOPS
6 (1536 bytes) per thread	204.9 GFLOPS	329.5 GFLOPS	351.5 GFLOPS	208.1 GFLOPS	326.5 GFLOPS	325.3 GFLOPS
7 (1792 bytes) per thread	204.9 GFLOPS	328.6 GFLOPS	351.5 GFLOPS	208.1 GFLOPS	326.5 GFLOPS	325.1 GFLOPS
8 (2048 bytes) per thread	205.0 GFLOPS	327.2 GFLOPS	351.5 GFLOPS	206.1 GFLOPS	320.9 GFLOPS	324.5 GFLOPS
9 (2304 bytes) per thread	205.0 GFLOPS	329.5 GFLOPS	351.5 GFLOPS	209.2 GFLOPS	318.4 GFLOPS	325.3 GFLOPS
10 (2560 bytes) per thread	205.0 GFLOPS	329.5 GFLOPS	351.5 GFLOPS	205.4 GFLOPS	322.5 GFLOPS	325.1 GFLOPS
11 (2816 bytes) per thread	205.0 GFLOPS	329.4 GFLOPS	351.4 GFLOPS	206.7 GFLOPS	326.6 GFLOPS	326.9 GFLOPS
12 (3072 bytes) per thread	204.9 GFLOPS	327.2 GFLOPS	351.4 GFLOPS	208.1 GFLOPS	323.8 GFLOPS	327.9 GFLOPS
13 (3328 bytes) per thread	204.9 GFLOPS	329.4 GFLOPS	351.5 GFLOPS	205.6 GFLOPS	326.6 GFLOPS	326.6 GFLOPS
14 (3584 bytes) per thread	205.0 GFLOPS	327.2 GFLOPS	351.5 GFLOPS	205.8 GFLOPS	326.6 GFLOPS	324.9 GFLOPS
15 (3840 bytes) per thread	205.0 GFLOPS	329.4 GFLOPS	351.3 GFLOPS	206.4 GFLOPS	325.6 GFLOPS	323.4 GFLOPS
16 (4096 bytes) per thread	205.0 GFLOPS	329.4 GFLOPS	351.4 GFLOPS	206.9 GFLOPS	326.5 GFLOPS	325.6 GFLOPS

Versus any other broadcast mode:

Z Accumulators	1 Thread	2 Threads	3 Threads	4 Threads	5 Threads	6 Threads
1 (256 bytes) per thread	102.5 GFLOPS	164.7 GFLOPS	247.1 GFLOPS	286.8 GFLOPS	357.7 GFLOPS	368.6 GFLOPS
2 (512 bytes) per thread	205.0 GFLOPS	329.5 GFLOPS	494.3 GFLOPS	464.8 GFLOPS	502.7 GFLOPS	540.1 GFLOPS
3 (768 bytes) per thread	307.4 GFLOPS	410.9 GFLOPS	528.5 GFLOPS	505.9 GFLOPS	530.3 GFLOPS	549.7 GFLOPS
4 (1024 bytes) per thread	409.9 GFLOPS	505.2 GFLOPS	548.0 GFLOPS	541.9 GFLOPS	551.1 GFLOPS	559.2 GFLOPS
5 (1280 bytes) per thread	409.7 GFLOPS	505.3 GFLOPS	547.7 GFLOPS	541.9 GFLOPS	554.1 GFLOPS	553.6 GFLOPS
6 (1536 bytes) per thread	409.8 GFLOPS	505.3 GFLOPS	547.9 GFLOPS	542.0 GFLOPS	550.2 GFLOPS	559.3 GFLOPS
7 (1792 bytes) per thread	409.9 GFLOPS	505.0 GFLOPS	547.8 GFLOPS	541.9 GFLOPS	550.4 GFLOPS	559.5 GFLOPS
8 (2048 bytes) per thread	409.8 GFLOPS	505.3 GFLOPS	547.8 GFLOPS	542.1 GFLOPS	550.5 GFLOPS	559.4 GFLOPS
9 (2304 bytes) per thread	409.9 GFLOPS	505.5 GFLOPS	547.1 GFLOPS	541.8 GFLOPS	550.5 GFLOPS	554.9 GFLOPS
10 (2560 bytes) per thread	409.9 GFLOPS	505.3 GFLOPS	548.1 GFLOPS	541.8 GFLOPS	550.4 GFLOPS	559.3 GFLOPS
11 (2816 bytes) per thread	409.9 GFLOPS	505.4 GFLOPS	547.8 GFLOPS	540.9 GFLOPS	545.5 GFLOPS	557.9 GFLOPS
12 (3072 bytes) per thread	409.9 GFLOPS	505.2 GFLOPS	548.1 GFLOPS	542.0 GFLOPS	550.4 GFLOPS	559.1 GFLOPS
13 (3328 bytes) per thread	409.9 GFLOPS	505.4 GFLOPS	547.8 GFLOPS	542.6 GFLOPS	550.4 GFLOPS	549.7 GFLOPS
14 (3584 bytes) per thread	409.8 GFLOPS	505.4 GFLOPS	547.9 GFLOPS	545.1 GFLOPS	550.4 GFLOPS	559.0 GFLOPS
15 (3840 bytes) per thread	410.0 GFLOPS	505.3 GFLOPS	547.8 GFLOPS	544.8 GFLOPS	550.4 GFLOPS	558.7 GFLOPS
16 (4096 bytes) per thread	409.8 GFLOPS	505.2 GFLOPS	547.9 GFLOPS	541.9 GFLOPS	550.5 GFLOPS	555.1 GFLOPS