vmihailenco / taskq

Golang asynchronous task/job queue with Redis, SQS, IronMQ, and in-memory backends

Home Page:https://taskq.uptrace.dev/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Struct members passed to sync/atomic calls need to be 64-bit aligned on ARM and MIPS-32

shaunco opened this issue · comments

commented

Per the Go docs for sync/atomic at https://pkg.go.dev/sync/atomic#pkg-note-BUG:

On ARM, 386, and 32-bit MIPS, it is the caller's responsibility to arrange for 64-bit alignment of 64-bit words accessed atomically. The first word in a variable or in an allocated struct, array, or slice can be relied upon to be 64-bit aligned.

taskq uses sync/atomic operations on quite a few struct members that are 32 bits in size, which in turn makes them 32-bit aligned in memory. Under heavy stress on ARM64 and MIPS-32 platforms, we have seen quite a few instances of this causing issues.

Since none of the int32/uint32 values are repeated (that is, they are all just a single value rather than in arrays), the easiest fix for this is to simply make them int64 values and update the corresponding sync/atomic calls. Making this change has minimal impact on overall memory usage - approximately 100 bytes (25 single instance int32/uint32 variables change to int64/uint64).

it is the caller's responsibility to arrange for 64-bit alignment of 64-bit words accessed atomically

That is why we use 32-bit words (uint32), not uint64 as you suggest. AFAIK uint32 can be safely used with atomic.

commented

Unfortunately it is NOT safe due to the bug, which is why the sync/atomic docs point that out. The very first uint32 in a struct is safe to pass, but the remaining ones alternate between being aligned on a 32-bit boundary and being aligned on a 64-bit boundary.

Slightly more info in https://pkg.go.dev/golang.org/x/tools/go/analysis/passes/atomicalign?utm_source=godoc (not part of go vet yet)

I believe what you are quoting only applies to 64-bit numbers. Previously, I had to switch from uint64 to uint32 to make the code work.

You could try to reproduce the problem with race detector. It is easily reproducible on 386 with uint64, but uint32 passes the check.

commented

We were seeing very frequent memory corruption leading to pretty random panics on ARM Cortex-A72 running 64-bit Yocto Dunfell (3.1) with the 32-bit atomic operations and variables. We made this change in July and have not had any memory corruption or random panics on ARM since. We had sat on it for the last 5 months so we could keep observing production systems with the change to ensure this fixed things.

That said, looking at the atomicalign source code, they do only check for use of the 64-bit atomic methods.

Let this PR sit for this week and I will switch back to the 32-bit atomic methods and variables to see if the corruption and random panics come back.

👍 Just in case, you can easily run race detector for 386 platform:

env GOOS=linux GOARCH=386 go test -race