twest820/AVX-512

AVX-512 Instruction Groups and Programming Intrinsics

While AVX-512 is most visibly an extension of AVX and AVX2 to a 512 bit width, AVX-512VL instructions are 128 or 256 bits wide. The VL subset comprises 27% of AVX-512 intrinsics and is often of greater interest than 512 bit operation. AMD Zen 4 processors implement AVX-512 at 256 bit width and Intel processors may not be faster at 512 bits than they are at 256 bits. AVX-512 and AVX-512VL's primary advantage over previous instruction sets (AVX, AVX2, and FMA) is arguably reduction in register spilling due to expansion from 16 ymm to 32 zmm registers and the addition of eight mask registers. The number of intrinsics triples to provide mask and maskz versions which make use of the mask registers. Of the 13 current instruction groups four are general (F, VL, DQ, BW) and 10 accelerate more specific workloads (BITALG, BF16, CD, FP16, IFMA52, VBMI, VBMI2, VNNI, and VPOPCNTDQ).

group	what	AVX10	instructions	intrinsics (VL)	Zen	Raptor, Golden Cove	Sunny Cove	Skylake	Knights
F	foundation	yes	389	1435	4	Emerald, Sapphire	Rocket, Tiger, Ice	X, Xeon 2017+	all
VL	128 and 256 bit widths	yes	223	1208 (1028)	4	Emerald, Sapphire	Rocket, Tiger, Ice	X, Xeon 2017+
DQ	doubleword and quadword	yes	87	399 (176)	4	Emerald, Sapphire	Rocket, Tiger, Ice	X, Xeon 2017+
BW	byte and word	yes	150	764 (446)	4	Emerald, Sapphire	Rocket, Tiger, Ice	X, Xeon 2017+
CD	conflict detection	yes	8	42 (28)	4	Emerald, Sapphire	Rocket, Tiger, Ice	X, Xeon 2017+	all
BITALG	population count expansion	yes	5	24 (12)	4	Emerald, Sapphire	Rocket, Tiger, Ice
IFMA52	big integer FMA	yes	3	18 (12)	4	Emerald, Sapphire	Rocket, Tiger, Ice	(Cannon Lake)
VBMI	vector byte manipulation	yes	8	30 (20)	4	Emerald, Sapphire	Rocket, Tiger, Ice	(Cannon Lake)
VBMI2	vector byte manipulation	yes	21	150 (100)	4	Emerald, Sapphire	Rocket, Tiger, Ice
VNNI	vector neural network	yes	5	36 (24)	4	Emerald, Sapphire	Rocket, Tiger, Ice	Cascade Lake
VPOPCNTDQ	population count	yes	3	18 (12)	4	Emerald, Sapphire	Rocket, Tiger, Ice		Mill
BF16	half precision	yes	5	27 (18)	4	Emerald, Sapphire
FP16	half precision	yes	96	938 (1600)	4	Emerald, Sapphire
VP2INTERSECT	vector pair to mask pair	no	2	6			Tiger
ER	exponential and reciprocal	no	12	60					all
PF	prefetch	no	9	20					all
4FMAPS	single precision 4x1 FMA	no	4	12					Mill
4NNIW	vector neural network	no	2	6					Mill
total		no	1031	5193 (2060)

AVX-512 was introduced by Intel in 2016 on Xeon Phi processors (Knights Landing and, later, Knights Corner). Beginning in Q3 2017, Intel Skylake X-series parts (i7 and i9) and Xeon processors enabled support 3959 of the 5139 AVX-512 intrinsics now defined by Intel. In Q3 2019, Ice Lake (Sunny Cove microarchitecture) expanded the set to 4130 intrinsics and the Golden Cove microarchitecture to 5095 intrinsics (announced Q4 2021 but supporting parts appear unlikely to ship until 2023). Xeon Phi's 4FMAPS, 4NNIW, and PF instruction groups have been superceded by more recent groups and architectural changes, thus appearing to be obsolete. ER instructions are valuable to certain floating point calculations but have not been reimplemented.

For AMD parts, the table above is based on Phoronix' performance analysis as AMD hasn't updated the AMD64 Architecture Programmer's Manual.

For Intel parts, the table above derives from the Intel Intrinsics Guide, Intel ARK, and Intel 64 and IA-32 Architectures Software Developer's Manuals. It will therefore be inaccurate if Intel's information is inaccurate or if transcription errors were made. In particular, sections 15.2-4 of the architecture manual, volume 1, require software check for F before using other groups. However, the Intrinsics Guide does not indicate corresponding dependencies for many groups. The spreadsheet in this repo lists each group's instructions and intrinsics.

AVX10

In July 2023, Intel announced AVX10. AVX10 formalizes consistent availability of 128, 256, and 512 bit instructions (AVX10/128, AVX10/256, and AVX10/512) across AVX-512 subsets (Intel 2023a, 2023b) and is expected launch as AVX10.1 in 2024 via Granite Rapids Xeons. As of late 2023 it appears most likely Xeon P-cores will support AVX10/512 while E-cores (and possibly desktop P-cores) will support AVX10/256. AVX10 appears to be backwards compatible with existing AVX-512 code and VL intrinsics at the given width.

AVX-512 Availablity

release dates	processor	laptop, desktop	workstation, server
Q4 2023	Emerald Rapids		Silver, Gold, Platinum
Q2 to Q4 2023	Zen 4		7900, 8004, 9004, 97x4
Q1 2023	Sapphire Rapids		W, Bronze, Silver, Gold, Platinum
Q3 2022 to Q4 2023	Zen 4	7040, 7000
Q1 2021 to Q3 2021	Rocket Lake	i5, i7, i9	E, W
Q3 2020 to Q3 2021	Tiger Lake	i3, i5, i7, i9	W
Q2 2020	Cooper Lake		Gold, Platinum
Q3 2019 to Q2 2021	Ice Lake	i3, i5, i7	W, Silver, Gold, Platinum
Q2 2019 to Q1 2020	Cascade Lake	i9	W, Bronze, Silver, Gold, Platinum
Q3 2018	Cannon Lake	i3-8121U
Q3 2017 to Q2 2018	Skylake	X-Series i7, i9	W, Bronze, Silver, Gold, Platinum
Q4 2017	Knights Mill		Phi
Q2 2016 to Q4 2016	Knights Landing		Phi

Prior to AMD's Zen 4 release in September 2022, AVX-512 was most readily—albeit somewhat briefly—available on Intel 11^th generation i5, i7, and i9 parts (Rocket Lake) before being disabled on 12^th generation (Alder and, likely, Raptor Lake). Cascade Lake and Skylake Xeons provided AVX-512, along with more limited availabiliy from Cooper, Tiger, and Ice Lake. Cannon Lake i3s (Palm Cove microarchitecture) are rare and the Kaby and Coffee Lake Skylake iterations lack AVX-512. Alder Lake P-cores implement 14 instruction groups (Sunny Cove instructions plus BF16, FP16, and VP2INTERSECT) but typically have AVX-512 fused off. AVX-512 can be enabled on early Alder Lakes but Intel has suppressed this ability through microcode.

Specfic processors are listed in this repo's spreadsheet and the table above uses Intel ARK release dates for Intel parts. No Pentium or Celeron processor supports AVX-512. AMD did not support AVX-512 prior to Zen 4.

Instruction Group Dependencies and CPUID Flags

The 18 AVX-512 instruction groups have individual CPUID flags and, in principle, an instruction could require an arbitrary number of groups be present. In practice, this is rarely a concern as the only dependencies which exist are on groups F and VL. Since VL is not present independent of F on any of Intel's processors, its dependency is always satisfied. Similarly, all processors with instruction groups containing VL dependent intrinsics implement VL. There are also 324 128 or 256 bit intrinsics for which the Intrinsics Guide does not indicate a VL requirement. These are primarily ss and sd floating point intrinsics which modify only the low 32 or 64 bits of a register.

It appears reasonable to assume all Skylake and later processors with AVX-512 support will implement at least the F, CD, VL, DQ, and BW groups. While Intel could choose otherwise, doing would complicate hardware implementation, compiler support, and software compatibility. It is also plausible the BITALG, IFMA52, VBMI, VBMI2, VNNI, and VPOPCNT groups will be consistently implemented together from Ice Lake forward for the same reasons. However, Intel does not seem to have made any official statement regarding future compatibility. Similarly, AMD does not appear to have indicated what groups a Zen 4 implementation might support. However, an AMD implementation seems likely to be broadly compatible with Intel's.

Since AVX-512 is restricted to 512 bit widths on the now discontinued Xeon Phis, these processors do not implement the vector length extensions of the VL group. They therefore lack dependencies between groups and share only the F and CD groups with Skylake and later implementations.

Performance Considerations

The Skylake-Cascade Lake and Sunny Cove-Cypress Cove microarchitectures provide SIMD computation on ports 0, 1, and 5. It appears AVX-512 operation is obtained by combining ports 0 and 1 and, when two AVX-512 instructions per clock are supported, possibly by combining ports 5 and 6. In addition to downclocking when the instructions and threads used trigger downclocking by crossing thermal license boundaries, use of AVX-512 may sometimes be slower than implementing the same workload with AVX and AVX2. This occurs because the instruction rate decrease from 3x256 per clock to 2x512 may not be offset by wider loads and stores (Fog 2015, Stackoverflow), zmm register availability, or use of more efficient instructions. AVX-512 may be similarly disadvantageous on processors restricted to 1x512, which includes bronze, silver, some 5000 series gold, and D Skylake Xeons as well as Knights processors. As of September 2022, how the addition of port 10 and other microarchitectural changes in Golden Cove may alter these considerations is uncertain.

In general, compute kernel throughput is sensitive to instruction level parallelism available within the kernel's inner loop and a processor's FMA, ALU, shift, floating point divide, shuffle, and load and store capabilities. For Ice Lake, Intel indicates one AVX-512 FMA and shuffle and two AVX-512 ALUs (e.g. Cutress 2019). Some kernels may therefore execute more quickly at 256 bit width due to accessing two FMA and shuffle units rather than one. Additionally, using the 1400 128 and 256 bit AVX-512VL intrinsics to reduce register spilling from 32 zmm instead of 16 ymm registers may be more efficient than the width of the 3700 512 bit intrinsics. In some cases 128 bit kernels can also be faster than 256 or 512 bit ones due to computational details of the kernel such as dependencies between AVX lanes.

It's therefore useful to profile SIMD implementations at 128, 256, and 512 bit widths across processors and amounts of computation to be performed. In performance code segments, this can result in width dispatching being controlled by loop content or iteration count rather than which instructions are supported by the processor.

twest820 / AVX-512

AVX-512 Instruction Groups and Programming Intrinsics

AVX10

AVX-512 Availablity

Instruction Group Dependencies and CPUID Flags

Performance Considerations

About