flame / blis

BLAS-like Library Instantiation Software Framework

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

armsve: generic kernel and default cache values

egaudry opened this issue · comments

As a user running on a node based on neoverse-v1 design, I'd like to us the armsve kernels with a better performance level than the neon-based ones.

This issue is a follow' up of #613 and #612 where the question of using generic values for

BLIS_SVE_W_L1 # L1 number of sets
BLIS_SVE_N_L1 # L1 associativity degree
BLIS_SVE_C_L1 # L1 cache line size in bytes
BLIS_SVE_W_L2 # L2 number of sets
BLIS_SVE_N_L2 # L2 associativity degree
BLIS_SVE_C_L2 # L2 cache line size in bytes
BLIS_SVE_W_L3 # any big value
BLIS_SVE_N_L3 # 4 is OK
BLIS_SVE_C_L3 # any big value

was raised, as performance using armsve with d514658 was 25% of that when running with thunderx2 kernels.

As noted by @xrq-phys,

There are several macros/environment variables which need to be tuned to get good cache block sizes. It would be nice to have a way of getting values automatically.

The data found here https://en.wikichip.org/wiki/arm_holdings/microarchitectures/neoverse_v1 and https://developer.arm.com/documentation/101427/latest/ might help.

For the sake of gathering information, here are some abstract from the ARM technical reference paper referenced above:

A6.1 About the L1 memory system
The Neoverse V1 L1 memory system is designed to enhance core performance and save power.
The L1 memory system consists of separate instruction and data caches. Both have a fixed size of 64KB.

A6.1.1 L1 instruction-side memory system
The L1 instruction memory system has the following key features:
• Virtually Indexed, Physically Tagged (VIPT) 4-way set-associative L1 instruction cache, which
behaves as a Physically Indexed, Physically Tagged (PIPT) cache
• Fixed cache line length of 64 bytes
• Pseudo-LRU cache replacement policy
• 256-bit read interface from the L2 memory system
• Optional instruction cache hardware coherency
The Neoverse V1 core also has a Virtually Indexed, Virtually Tagged (VIVT) 4-way skewed-associative,
Macro-OP (MOP) cache, which behaves as a PIPT cache.

A6.1.2 L1 data-side memory system
The L1 data memory system has the following features:
• Virtually Indexed, Physically Tagged (VIPT), which behaves as a Physically Indexed, Physically
Tagged (PIPT) 4-way set-associative L1 data cache
• Fixed cache line length of 64 bytes
• Pseudo-LRU cache replacement policy
• 512-bit write interface from the L2 memory system
• 512-bit read interface from the L2 memory system
• One 128-bit and two 256-bit read paths from the data L1 memory system to the datapath
• 256-bit write path from the datapath to the L1 memory system
A7.1 About the L2 memory system
The L2 memory subsystem consists of:
• An 8-way set associative L2 cache with a configurable size of 512KB or 1MB. Cache lines have a
fixed length of 64 bytes.
• ECC protection for all RAM structures except victim array.
• Strictly inclusive with L1 data cache. Weakly inclusive with L1 instruction cache.
• Configurable CHI interface to the DynamicIQ Shared Unit (DSU) or CHI compliant system with
support for a 256-bit data width.
• Dynamic biased replacement policy.
• Modified Exclusive Shared Invalid (MESI) coherency

Retrieved from a running system:

/sys/devices/system/cpu/cpu0/cache/index0/allocation_policy:ReadWriteAllocate
/sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size:64
/sys/devices/system/cpu/cpu0/cache/index0/level:1
/sys/devices/system/cpu/cpu0/cache/index0/number_of_sets:1
/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0
/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_map:00000001
/sys/devices/system/cpu/cpu0/cache/index0/type:Data
/sys/devices/system/cpu/cpu0/cache/index0/write_policy:WriteBack
/sys/devices/system/cpu/cpu0/cache/index1/allocation_policy:ReadAllocate
/sys/devices/system/cpu/cpu0/cache/index1/coherency_line_size:64
/sys/devices/system/cpu/cpu0/cache/index1/level:1
/sys/devices/system/cpu/cpu0/cache/index1/number_of_sets:1
/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0
/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_map:00000001
/sys/devices/system/cpu/cpu0/cache/index1/type:Instruction
/sys/devices/system/cpu/cpu0/cache/index1/write_policy:WriteBack
/sys/devices/system/cpu/cpu0/cache/index2/allocation_policy:ReadWriteAllocate
/sys/devices/system/cpu/cpu0/cache/index2/coherency_line_size:64
/sys/devices/system/cpu/cpu0/cache/index2/level:2
/sys/devices/system/cpu/cpu0/cache/index2/number_of_sets:1
/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0
/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_map:00000001
/sys/devices/system/cpu/cpu0/cache/index2/type:Unified
/sys/devices/system/cpu/cpu0/cache/index2/write_policy:WriteBack
/sys/devices/system/cpu/cpu0/cache/index3/allocation_policy:ReadWriteAllocate
/sys/devices/system/cpu/cpu0/cache/index3/coherency_line_size:64
/sys/devices/system/cpu/cpu0/cache/index3/level:3
/sys/devices/system/cpu/cpu0/cache/index3/number_of_sets:32768
/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-31
/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_map:ffffffff
/sys/devices/system/cpu/cpu0/cache/index3/size:32768K
/sys/devices/system/cpu/cpu0/cache/index3/type:Unified
/sys/devices/system/cpu/cpu0/cache/index3/ways_of_associativity:16
/sys/devices/system/cpu/cpu0/cache/index3/write_policy:WriteBack

Based on this, I'm not sure how to setup the below macro correctly.
I read https://github.com/flame/blis/blob/master/docs/ConfigurationHowTo.md, more related to x86, and I used the following values however performance remained terrible.

#define W_L1_SVE_DEFAULT 64
#define N_L1_SVE_DEFAULT 4
#define C_L1_SVE_DEFAULT 64
#define W_L2_SVE_DEFAULT 512
#define N_L2_SVE_DEFAULT 8
#define C_L2_SVE_DEFAULT 64

I believe I missed some points regarding bli_cntx_init_armsve as well.

@devinamatthews @xrq-phys
this is the information I received for AWS Graviton 3:

L1 associativity 4, size 64KB, 256 sets, 64B lines
L2 associativity 8, size 1MB, 2k sets, 64B lines
L3 32MB, 64B lines, massive associativity

would this translates to (L3 values are just a guess)

#define W_L1_SVE_DEFAULT 256
#define N_L1_SVE_DEFAULT 4
#define C_L1_SVE_DEFAULT 64
#define W_L2_SVE_DEFAULT 2048
#define N_L2_SVE_DEFAULT 8
#define C_L2_SVE_DEFAULT 64
#define W_L3_SVE_DEFAULT 8192
#define N_L3_SVE_DEFAULT 4
#define C_L3_SVE_DEFAULT 64

?

Sorry for the delay.

It seems strange to me even with correctly set cache sizes you are only able to get 25% perf.

Another benchmark on V1 once told me a 10% boost.

I'll see if I can find any Graviton 3 nodes avail to me.

thanks, I would expect a performance bump indeed.
fwiw, I'm using gcc-11 to build blis-master (explicitly building the armsve variant).

By reading at the source, I do not see and understand how we would use a SVE256 implementation when running armsve on a neoverse-v1 chip: there is none.

I'm sorry, I don't understand.
Do you mean that no ASM kernels would be needed to get the best performance when running on SVE 256-bit wide ?

ok, thanks for your feedback, my understanding was wrong obviously then :).
good luck with the graviton3 when you have time then !

Hi.

I just tried to launch a C7g instance on AWS since it became generally available at the end of May.

However, I cannot seem to reproduce the 75% perf. decline you claimed in this issue. Rather, it's a 10% perf. gain:

--- run/blis_on_c7g> cat tx2.x/tx2.out.m | grep dgemm_nn_ccc | grep 360 # ThunderX2 config w/ NEON kernels.
blis_dgemm_nn_ccc                  360   360   360    20.73   9.64e-18   PASS
--- run/blis_on_c7g> cat neon.x/firestorm.out.m | grep dgemm_nn_ccc | grep 360 # Firestorm config w/ NEON kernels.
blis_dgemm_nn_ccc                  360   360   360    21.35   9.61e-18   PASS
--- run/blis_on_c7g> cat native.x/sve256.out.m | grep dgemm_nn_ccc | grep 360 # ArmSVE config w/ SVE kernels.
blis_dgemm_nn_ccc                  360   360   360    23.20   9.72e-18   PASS

The full output is here:
out.m.tar.gz