Test how many fully loaded cores we can support with multiple collators

Question

Test how many fully loaded cores we can support with multiple collators

eskimor opened this issue 2 months ago · comments

eskimor commented 2 months ago

in different network conditions (bandwidth/latency)
With beefy/standard collators + cost estimate on the beefy ones
With simple optimizations like avoiding disk access (most of the time)/having real fast SSDs

Goal: Have some data on how far we can go without needing optimistic block production.

Andrei Sandu · Answer 1 · Mon Jun 17 2024 20:20:11 GMT+0800 (China Standard Time)

Local Zombienet based testing results for scaling up the collators to 5 collators.

HW spec: MacBook Pro M2
Validators + rococo runtime: master branch
Glutton collators: Slot Based Collator + Glutton block bloat + CheckWeigth fix

3 relay chain validators and 3 cores. Config below:

max_candidate_depth = 6
allowed_ancestry_len = 2
max_validators_per_core = 1
scheduling_lookahead = 2
num_cores = 2

Glutton pallet configuration:

compute = "1000000000"
storage = "1000000000"
blockLength = "950000000"
trashDataCount = 5120

Collations sizes produced in the test are ~5MB, burning 2s of ref hw cpu time in roughly 1.3s due to the better specs of M2.

PoV size { header: 0.22kB, extrinsics: 4867.59kB, storage_proof: 118.97kB }
Compressed PoV size: 4986.9091796875kb

5 collators with 2s slot time, 1.3s execution

Parachains manages to include on average 1.5 parachain candidates per relay chain block, leading to an effective block time of 4s. Two parachain forks are created per relay chain block as parachain block authorship is fully asynchronous and only obeys slot. The import time windows spans from previous slot into the current slot, and parachain will build on best block available at the start of it's slot.

The slot based collator implements an offset for the parachain slots. So if relay chain slot times starts at T = 0, then the 3 corresponded slots start at 1s, 3s, 5s .

A summary of what happens below:

Relay chain slot starts at T = 0
Collator A starts authoring block 1 at T = 1.0
Collator A finishes authoring and announces at T = ~2.3s
- block 1 is backed
Collators start to import 1 at T = ~2.4s
Collator B starts to author block 1' (fork) at T = ~3.0s
Collators finish import of block 1 at T = ~3.7s
Collator B finishes authoring of block 1' (fork) and announces at T = ~4.3s
- block 1' is backed by backing group
Collator C starts authoring block 2 at T = ~5.0s
Collators finish import block 1'(fork) T = ~5.6
1 block (block 1) backed on chain, block 1' is backed, part of a prospective parachain (will soon discard as 1 will be included)
Next relay chain slot begins at T = 6s
Collator C announces block 2 at T = 6.3s
Collator D starts to author block 2' (fork) built on 1 at T = 7s.
All collators finish import of 2 at T = ~7.6s
Block 2 (built on 1) is backed in prospective parachains at T = 7.6s
Collator D announces block 2' at T = 8.3s
Collator E starts building 3 on top of 2 at T = 9s
All collators imports block 2' at T = ~9.6s
Collator E announces block 3 built on top of 2 at T = 10.3
Collator A starts authoring block 3on top of 2 at T = 11s`
All collators import block 3' at T = 11.6s
1 block (block 2) backed on chain and 2', 3 are backed and part of prospective parachain
Next relay chain slot begins at T = 12s
2 blocks (block 3, 4) are backed on chain, 4', 5 part of prospective parachain

As this loop goes on, the number of included candidates per relay chain block creates this pattern: 1, 2, 1, 2, 1. The spikes happen at session boundary.

5 collators with 3s slot time, 1.3s execution

With an authoring duration of just 1.3s and slots of 3s, the collators can fully utilize the block space, yielding straight 3s blocks.

5 collators with 3s slot time, ~2s execution

Now bumping glutton parameter: compute = "1450000000", which gives us ~1.9s authoring duration on collators:

[Parachain] 🎁 Prepared block for proposing at 3 (1909 ms)

In this configuration, parachain always produces 2 forks per relay chain, so, it's effective block time will remain 6s. Below is a summary on slot/import timings:

Relay chain slot starts at T = 0s
Collator A starts authoring 1 at T = 1s
Collator A announces 1 at T = 3s
Collator B starts authoring 1' at T = 4s
All collators import 1 at T = 5s
Collator B announces 1' at T = 6s
Nothing backed on relay chain, block 1, 1' are prospective parachains
Relay chain slot starts at T = 6s
Collator C starts authoring 2 on top of 1 at T = 7s
All collators import 1' at T = 8
Collator C announces 2 at T = 9s
Collator D starts authoring 2' on top of 1' at T = 10s
All collators import 2 at T = 11s
1 backed on chain, block 2, 2' in prospective parachains
Relay chain slot starts at T = 12s
2 backed on chain, block 3, 3' prospective parachains
....

Preliminary conclusion

As pointed out by these local zombienet experiments, parachains will waste slots by forking if 2 * authorship duration + network overheads > slot time.

With reference hardware and current implementation peak performance is:

Cores: 2 (3s block time)
Block len: 5MB
PoV len: 5MB
Authorship/import time: 1.3s

The upper bound on networking overhead is 400ms.

Next steps

Versi testing with 10 collators, network impairment (latency)
Find instances with CPUs that offer highest single core performance, ideally we want 2x faster than ref hw which would allow us to use 3 cores (2s block times) with 1.6s execution leaving 400ms for network overhead
Test unconnected collator behaviour to see how bad it is in practice. In theory, we only announce after we imported, so an unconnected node will see the announcement after at least (execution_time + network overhead) * hops depending on how far away (hops) it is from the author

Alin Dima · Answer 2 · Mon Jun 17 2024 20:58:56 GMT+0800 (China Standard Time)

Interesting results! Nice work!

Leaving some thoughts here:

if a collator is importing a block, it could wait for that to finish before authoring a fork (AFAICT from your experiments, the subsequent forks are never backed on chain, unless the first candidate is invalid). It's therefore wasted work usually, especially since it won't be accepted by prospective-parachains as a fork.
when doing tests on the macbook myself, I observed high variances in performance based on system load (something to keep in mind)

Sebastian Kunert · Answer 3 · Mon Jun 17 2024 22:06:03 GMT+0800 (China Standard Time)

if a collator is importing a block, it could wait for that to finish before authoring a fork (AFAICT from your experiments, the subsequent forks are never backed on chain, unless the first candidate is invalid). It's therefore wasted work usually, especially since it won't be accepted by prospective-parachains as a fork.

I don't think we can do this easily:

Afaik even after header verification block import could still fail.
We should pick the timings so that they work even if the block is full. Waiting for the block to import means that our own block starts late to author. Which might lead to the next node waiting for our block to import etc. The alternative would be to wait for the block to import but reduce authoring time for our own block. But all that increases complexity and in effect we only reduce one blocks execution time for another.

Andrei Sandu · Answer 4 · Mon Jun 17 2024 22:20:03 GMT+0800 (China Standard Time)

when doing tests on the macbook myself, I observed high variances in performance based on system load (something to keep in mind)

Yes, I got that problem myself, but I have stopped rust-analyzer and got stable results with 3 validators and 5 collators. That that's the most my laptop can do.