a little improvement for `cacheRemap16Byte`

Question

a little improvement for `cacheRemap16Byte`

shadw3002 opened this issue a year ago · comments

Summary

it seems that collection LSCQ use function cacheRemap16Byte to remap index used by SCQ, to do best effort to avoid false-sharing, maybe we could refer to folly UnboundedQueue's index, to remap index with less instructions.

however, i see no noticeable change in performance via BenchmarkDefault, perhaps because the function is not on the critical path.

Motivation

improve cacheRemap16Byte with less instructions

Explanation

Why it works?

for scqsize = 2^n, and k = 2*m + 1, define remap(i)=k*i % scqsize we can prove that: for any abs(i_1 - i_2) < scqsize, if i_1 != i_2, than remap(i_1) != remap(i_2).

this property makes remap() one-to-one map any i in [x, x + scqsize) to [0, scqsize).

Compare Assembly Code

package main

const (
	scqsize       = 1 << 16
	cacheLineSize = 64
)

func cacheRemap16ByteA(index uint64) uint64 {
	const cacheLineSize = cacheLineSize / 2
	rawIndex := index & uint64(scqsize-1)
	cacheLineNum := (rawIndex) % (scqsize / uint64(cacheLineSize))
	cacheLineIdx := rawIndex / (scqsize / uint64(cacheLineSize))
	return cacheLineNum*uint64(cacheLineSize) + cacheLineIdx
}

func cacheRemap16ByteB(index uint64) uint64 {
	return (index * 5) & (scqsize - 1)
}

func main() {}

        TEXT    main.cacheRemap16ByteA(SB), NOSPLIT|ABIInternal, $0-8
        FUNCDATA        $0, gclocals·g2BeySu+wFnoycgXfElmcg==(SB)
        FUNCDATA        $1, gclocals·g2BeySu+wFnoycgXfElmcg==(SB)
        FUNCDATA        $5, main.cacheRemap16ByteA.arginfo1(SB)
        FUNCDATA        $6, main.cacheRemap16ByteA.argliveinfo(SB)
        PCDATA  $3, $1
        MOVWLZX AX, CX
        ANDL    $2047, AX
        SHRQ    $11, CX
        SHLQ    $5, AX
        ADDQ    CX, AX
        RET
        TEXT    main.cacheRemap16ByteB(SB), NOSPLIT|ABIInternal, $0-8
        FUNCDATA        $0, gclocals·g2BeySu+wFnoycgXfElmcg==(SB)
        FUNCDATA        $1, gclocals·g2BeySu+wFnoycgXfElmcg==(SB)
        FUNCDATA        $5, main.cacheRemap16ByteB.arginfo1(SB)
        FUNCDATA        $6, main.cacheRemap16ByteB.argliveinfo(SB)
        PCDATA  $3, $1
        LEAQ    (AX)(AX*4), CX
        MOVWLZX CX, AX
        RET

ZhangYunHao · Answer 1 · Mon May 29 2023 19:34:43 GMT+0800 (China Standard Time)

it seems that collection LSCQ use function cacheRemap16Byte to remap index used by SCQ, to do best effort to avoid false-sharing, maybe we could refer to folly UnboundedQueue's index, to remap index with less instructions.

Looks like they are not the same function.

package main

const (
	scqsize       = 128
	cacheLineSize = 64
)

func cacheRemap16ByteA(index uint64) uint64 {
	const cacheLineSize = cacheLineSize / 2
	rawIndex := index & uint64(scqsize-1)
	cacheLineNum := (rawIndex) % (scqsize / uint64(cacheLineSize))
	cacheLineIdx := rawIndex / (scqsize / uint64(cacheLineSize))
	return cacheLineNum*uint64(cacheLineSize) + cacheLineIdx
}

func cacheRemap16ByteB(index uint64) uint64 {
	return (index * 5) & (scqsize - 1)
}

func main() {
	for i := uint64(0); i < 10; i++ {
		println(cacheRemap16ByteA(i), cacheRemap16ByteB(i))
	}
}

Output:

We can see that 0 32 64 96 is NOT in the same cache line, but 0 5 10 15 is in the same cache line (false sharing).

however, i see no noticeable change in performance via BenchmarkDefault, perhaps because the function is not on the critical path.

The cacheRemap is an improvement only for multiple-cores(+20%), for single-core, it is a performance degradation(-15%). You can run the benchmark with different cores(and disable or enable the cacheRemap).

hermit · Answer 2 · Tue May 30 2023 02:11:04 GMT+0800 (China Standard Time)

Looks like they are not the same function.

Yes, versionB provides another way to remap index.

We can see that 0 32 64 96 is NOT in the same cache line, but 0 5 10 15 is in the same cache line (false sharing).

It seems that the result of cacheRemap16Byte is used to index [scqsize]scqNodePointer, and sizeof(scqNodePointer) == 16 B, so I think 4 * 16 B gap is enough to make ring[i] and ring[i+5] in different cacheline. Besides, versionB is a general method, you can change 5 to any other proper odd number.

The cacheRemap is an improvement only for multiple-cores(+20%), for single-core, it is a performance degradation(-15%). You can run the benchmark with different cores(and disable or enable the cacheRemap).

Yes I know, but the meaning of 「no noticeable change in performance via BenchmarkDefault」 is between cacheRemap16ByteA and cacheRemap16ByteB, not between single-core and multi-core for cacheRemap16ByteB .

Supplement to the above: develop...shadw3002:gopkg:feat/improve-remap

The properties that remap should have

One-to-one map

		visited := make([]bool, scqsize)
		for i := uint64(0); i < scqsize; i++ {
			visited[cacheRemap16Byte(i)] = true
		}
		for i := uint64(0); i < scqsize; i++ {
			if !visited[i] {
				panic("unexpected")
			}
		}

for scqsize = 2^n, and k = 2*m + 1, define remap(i)=k*i % scqsize we can prove that: for any abs(i_1 - i_2) < scqsize, if i_1 != i_2, than remap(i_1) != remap(i_2).

prove:

if remap(i_1) == remap(i_2)
than (2*m + 1) * abs(i_1 - i_2) mod 2^n == 0
than abs(i_1 - i_2) == x * 2^n >= 2^n == scqsize, which is opposite to abs(i_1 - i_2) < scqsize

Shuffle index to access different cacheline

just simply let k = cachelineSize / entrySize + 1, for example 5.

		var (
			pre     = -1
			preline = -1
			nowline = -1
		)
		for i := 0; i < scqsize+1; i++ {
			line := int(cacheRemap16Byte(uint64(0)) / 4)
			nowline = line
			if preline == nowline {
				fmt.Printf("%d %d is in the same line.", pre, i)
			}
			pre = i
			preline = nowline
		}

Benchmark

Because only a few instructions are reduced, I did not find the performance difference in the function benchPair that can best reflect the performance difference in theory.

goos: darwin
goarch: amd64
pkg: github.com/bytedance/gopkg/collection/lscq
cpu: Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
                     │   old.txt   │               new.txt               │
                     │   sec/op    │   sec/op     vs base                │
Default/Pair/LSCQ-12   45.09n ± 1%   46.64n ± 1%  +3.44% (p=0.000 n=100)

However, I ran the benchmark 100 times separately and compared using the benchmarkstat tool, but I even found performance regression, so this improvement may not be necessary