twest820 / AVX-512

AVX-512 documentation beyond what Intel provides

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

AVX-512 Instruction Groups and Programming Intrinsics

While AVX-512 is most visibly an extension of AVX and AVX2 to a 512 bit width, AVX-512VL instructions are 128 or 256 bits wide. The VL subset comprises 27% of AVX-512 intrinsics and is often of greater interest than 512 bit operation. AMD Zen 4 processors implement AVX-512 at 256 bit width and Intel processors may not be faster at 512 bits than they are at 256 bits. AVX-512 and AVX-512VL's primary advantage over previous instruction sets (AVX, AVX2, and FMA) is arguably reduction in register spilling due to expansion from 16 ymm to 32 zmm registers and the addition of eight mask registers. The number of intrinsics triples to provide mask and maskz versions which make use of the mask registers. Of the 13 current instruction groups four are general (F, VL, DQ, BW) and 10 accelerate more specific workloads (BITALG, BF16, CD, FP16, IFMA52, VBMI, VBMI2, VNNI, and VPOPCNTDQ).

group what AVX10 instructions intrinsics (VL) Zen Raptor, Golden Cove Sunny Cove Skylake Knights
F foundation yes 389 1435 4 Emerald, Sapphire Rocket, Tiger, Ice X, Xeon 2017+ all
VL 128 and 256 bit widths yes 223 1208 (1028) 4 Emerald, Sapphire Rocket, Tiger, Ice X, Xeon 2017+
DQ doubleword and quadword yes 87 399 (176) 4 Emerald, Sapphire Rocket, Tiger, Ice X, Xeon 2017+
BW byte and word yes 150 764 (446) 4 Emerald, Sapphire Rocket, Tiger, Ice X, Xeon 2017+
CD conflict detection yes 8 42 (28) 4 Emerald, Sapphire Rocket, Tiger, Ice X, Xeon 2017+ all
BITALG population count expansion yes 5 24 (12) 4 Emerald, Sapphire Rocket, Tiger, Ice
IFMA52 big integer FMA yes 3 18 (12) 4 Emerald, Sapphire Rocket, Tiger, Ice (Cannon Lake)
VBMI vector byte manipulation yes 8 30 (20) 4 Emerald, Sapphire Rocket, Tiger, Ice (Cannon Lake)
VBMI2 vector byte manipulation yes 21 150 (100) 4 Emerald, Sapphire Rocket, Tiger, Ice
VNNI vector neural network yes 5 36 (24) 4 Emerald, Sapphire Rocket, Tiger, Ice Cascade Lake
VPOPCNTDQ population count yes 3 18 (12) 4 Emerald, Sapphire Rocket, Tiger, Ice Mill
BF16 half precision yes 5 27 (18) 4 Emerald, Sapphire
FP16 half precision yes 96 938 (1600) 4 Emerald, Sapphire
VP2INTERSECT vector pair to mask pair no 2 6 Tiger
ER exponential and reciprocal no 12 60 all
PF prefetch no 9 20 all
4FMAPS single precision 4x1 FMA no 4 12 Mill
4NNIW vector neural network no 2 6 Mill
total no 1031 5193 (2060)

AVX-512 was introduced by Intel in 2016 on Xeon Phi processors (Knights Landing and, later, Knights Corner). Beginning in Q3 2017, Intel Skylake X-series parts (i7 and i9) and Xeon processors enabled support 3959 of the 5139 AVX-512 intrinsics now defined by Intel. In Q3 2019, Ice Lake (Sunny Cove microarchitecture) expanded the set to 4130 intrinsics and the Golden Cove microarchitecture to 5095 intrinsics (announced Q4 2021 but supporting parts appear unlikely to ship until 2023). Xeon Phi's 4FMAPS, 4NNIW, and PF instruction groups have been superceded by more recent groups and architectural changes, thus appearing to be obsolete. ER instructions are valuable to certain floating point calculations but have not been reimplemented.

For AMD parts, the table above is based on Phoronix' performance analysis as AMD hasn't updated the AMD64 Architecture Programmer's Manual.

For Intel parts, the table above derives from the Intel Intrinsics Guide, Intel ARK, and Intel 64 and IA-32 Architectures Software Developer's Manuals. It will therefore be inaccurate if Intel's information is inaccurate or if transcription errors were made. In particular, sections 15.2-4 of the architecture manual, volume 1, require software check for F before using other groups. However, the Intrinsics Guide does not indicate corresponding dependencies for many groups. The spreadsheet in this repo lists each group's instructions and intrinsics.

AVX10

In July 2023, Intel announced AVX10. AVX10 formalizes consistent availability of 128, 256, and 512 bit instructions (AVX10/128, AVX10/256, and AVX10/512) across AVX-512 subsets (Intel 2023a, 2023b) and is expected launch as AVX10.1 in 2024 via Granite Rapids Xeons. As of late 2023 it appears most likely Xeon P-cores will support AVX10/512 while E-cores (and possibly desktop P-cores) will support AVX10/256. AVX10 appears to be backwards compatible with existing AVX-512 code and VL intrinsics at the given width.

AVX-512 Availablity

release dates processor laptop, desktop workstation, server
Q4 2023 Emerald Rapids Silver, Gold, Platinum
Q2 to Q4 2023 Zen 4 7900, 8004, 9004, 97x4
Q1 2023 Sapphire Rapids W, Bronze, Silver, Gold, Platinum
Q3 2022 to Q4 2023 Zen 4 7040, 7000
Q1 2021 to Q3 2021 Rocket Lake i5, i7, i9 E, W
Q3 2020 to Q3 2021 Tiger Lake i3, i5, i7, i9 W
Q2 2020 Cooper Lake Gold, Platinum
Q3 2019 to Q2 2021 Ice Lake i3, i5, i7 W, Silver, Gold, Platinum
Q2 2019 to Q1 2020 Cascade Lake i9 W, Bronze, Silver, Gold, Platinum
Q3 2018 Cannon Lake i3-8121U
Q3 2017 to Q2 2018 Skylake X-Series i7, i9 W, Bronze, Silver, Gold, Platinum
Q4 2017 Knights Mill Phi
Q2 2016 to Q4 2016 Knights Landing Phi

Prior to AMD's Zen 4 release in September 2022, AVX-512 was most readily—albeit somewhat briefly—available on Intel 11th generation i5, i7, and i9 parts (Rocket Lake) before being disabled on 12th generation (Alder and, likely, Raptor Lake). Cascade Lake and Skylake Xeons provided AVX-512, along with more limited availabiliy from Cooper, Tiger, and Ice Lake. Cannon Lake i3s (Palm Cove microarchitecture) are rare and the Kaby and Coffee Lake Skylake iterations lack AVX-512. Alder Lake P-cores implement 14 instruction groups (Sunny Cove instructions plus BF16, FP16, and VP2INTERSECT) but typically have AVX-512 fused off. AVX-512 can be enabled on early Alder Lakes but Intel has suppressed this ability through microcode.

Specfic processors are listed in this repo's spreadsheet and the table above uses Intel ARK release dates for Intel parts. No Pentium or Celeron processor supports AVX-512. AMD did not support AVX-512 prior to Zen 4.

Instruction Group Dependencies and CPUID Flags

The 18 AVX-512 instruction groups have individual CPUID flags and, in principle, an instruction could require an arbitrary number of groups be present. In practice, this is rarely a concern as the only dependencies which exist are on groups F and VL. Since VL is not present independent of F on any of Intel's processors, its dependency is always satisfied. Similarly, all processors with instruction groups containing VL dependent intrinsics implement VL. There are also 324 128 or 256 bit intrinsics for which the Intrinsics Guide does not indicate a VL requirement. These are primarily ss and sd floating point intrinsics which modify only the low 32 or 64 bits of a register.

It appears reasonable to assume all Skylake and later processors with AVX-512 support will implement at least the F, CD, VL, DQ, and BW groups. While Intel could choose otherwise, doing would complicate hardware implementation, compiler support, and software compatibility. It is also plausible the BITALG, IFMA52, VBMI, VBMI2, VNNI, and VPOPCNT groups will be consistently implemented together from Ice Lake forward for the same reasons. However, Intel does not seem to have made any official statement regarding future compatibility. Similarly, AMD does not appear to have indicated what groups a Zen 4 implementation might support. However, an AMD implementation seems likely to be broadly compatible with Intel's.

Since AVX-512 is restricted to 512 bit widths on the now discontinued Xeon Phis, these processors do not implement the vector length extensions of the VL group. They therefore lack dependencies between groups and share only the F and CD groups with Skylake and later implementations.

Performance Considerations

The Skylake-Cascade Lake and Sunny Cove-Cypress Cove microarchitectures provide SIMD computation on ports 0, 1, and 5. It appears AVX-512 operation is obtained by combining ports 0 and 1 and, when two AVX-512 instructions per clock are supported, possibly by combining ports 5 and 6. In addition to downclocking when the instructions and threads used trigger downclocking by crossing thermal license boundaries, use of AVX-512 may sometimes be slower than implementing the same workload with AVX and AVX2. This occurs because the instruction rate decrease from 3x256 per clock to 2x512 may not be offset by wider loads and stores (Fog 2015, Stackoverflow), zmm register availability, or use of more efficient instructions. AVX-512 may be similarly disadvantageous on processors restricted to 1x512, which includes bronze, silver, some 5000 series gold, and D Skylake Xeons as well as Knights processors. As of September 2022, how the addition of port 10 and other microarchitectural changes in Golden Cove may alter these considerations is uncertain.

In general, compute kernel throughput is sensitive to instruction level parallelism available within the kernel's inner loop and a processor's FMA, ALU, shift, floating point divide, shuffle, and load and store capabilities. For Ice Lake, Intel indicates one AVX-512 FMA and shuffle and two AVX-512 ALUs (e.g. Cutress 2019). Some kernels may therefore execute more quickly at 256 bit width due to accessing two FMA and shuffle units rather than one. Additionally, using the 1400 128 and 256 bit AVX-512VL intrinsics to reduce register spilling from 32 zmm instead of 16 ymm registers may be more efficient than the width of the 3700 512 bit intrinsics. In some cases 128 bit kernels can also be faster than 256 or 512 bit ones due to computational details of the kernel such as dependencies between AVX lanes.

It's therefore useful to profile SIMD implementations at 128, 256, and 512 bit widths across processors and amounts of computation to be performed. In performance code segments, this can result in width dispatching being controlled by loop content or iteration count rather than which instructions are supported by the processor.

About

AVX-512 documentation beyond what Intel provides

License:Other