Thoughts about throughput benchmarks fairness

Question

Thoughts about throughput benchmarks fairness

phuslu opened this issue 4 months ago · comments

After reading scalalang2/go-cache-benchmark#4 , especially @maypok86’s comment. I think below points should be clarify for fairness

CPU cores choosing (8/16 cores or higher)
GitHub actions environment (busy github runners)
Test real-world/bigtech workload(alibaba?)

Phus Lu · Answer 1 · Tue Apr 02 2024 11:55:08 GMT+0800 (China Standard Time)

Background: the focus of phuslu/lru development is web/http cases, so

CPU cores choosing (8/16 cores or higher)

I chosen 8/16 cores for benchmark because it's the common case(e.g. containers specification) and a "comfort zone" of go application, see #13 golang/go#65064 panjf2000/gnet#558

GitHub actions environment (busy github runners)

as @maypok86 pointed, it's unfair for cache-friendly implementation. But it still similar with our production environment(k8s workers), and the performance results index seems stable, so I kept/respected the result. We should/need append the disclaimer about this "unfairness"

Test real-world/bigtech workload(alibaba?)

Still have no good ideas, seems it's hard do this in github actions or my arm vps.

Phus Lu · Answer 2 · Tue Apr 02 2024 12:19:59 GMT+0800 (China Standard Time)

the draft:

Disclaimer: This was tested on the busy GitHub runners with 8 CPUs and the results may be very different from your real environment.

Phus Lu · Answer 3 · Tue Apr 02 2024 13:46:18 GMT+0800 (China Standard Time)

For comment of scalalang2/go-cache-benchmark#4 (comment) ,

To work around this, let's view a cache as a black box with a certain size in bytes (memory usage) and ignore any internals. Now you can create all test candidates with roughly the same memory usage and compare QPS and HITRATE. With this, you eliminate one of three variables, which makes comparison easier.

I have similar idea. In this scenario I guess phuslu/lru will significantly increases hit ratio and a bit deceases the throughputs.

Alexey Mayshev · Answer 4 · Wed Apr 03 2024 06:45:42 GMT+0800 (China Standard Time)

Oh, I never thought I'd come back to this.

Throughput

I never meant that your benchmarks are not fair to any implementation. I was just trying to make them a little better and find out why one or the other was done.
Unfortunately, I do not know how GitHub Runners work (like the vast majority of people) and the problem I was talking about is that I did not even get a similar distribution to yours either locally or on a virtual machine. For example, Cloudflare has shown much better results, and your implementation has not outperformed the others convincingly enough. And I have given several theories as to why this might happen. This is most likely caused by fewer available cores than needed, or by using only a certain percentage of vCPUs.
Yes, 8-16 cores are usually the best choice. Sometimes you want to look at the behavior with more cores and more contention, but it's better to check the ability to scale across cores rather. So that there are no such problems as with epoll on large machines.

Memory consumption

I have similar idea. In this scenario I guess phuslu/lru will significantly increases hit ratio and a bit deceases the throughputs.

The idea is really not a bad one, although I have a suspicion that the results of such benchmarks can easily be misinterpreted, but it's worth a try.

Unfortunately, I have a feeling that the author of go-freelru does not understand the difference between the memory cost of metadata and the storage of keys and values. For some reason, he uses principles from dedicated cache servers (redis, memcached). There really is only one limitation - the amount of available RAM (with its total amount). Redis users do not care how many entries will be stored, only the total RAM consumption is important. BUT when you don't support cost-based eviction, trying to rely on total RAM consumption is a terrible mistake.

go-freelru stores a much larger number of entries than the user requests. I already wrote about this in the previous thread, I think everything is clear here. All this will end up consuming too much memory. Still, the total size of the key and value very often exceeds 100-150 bytes, which greatly exceeds the size of the metadata.
But why it was necessary to put x4 in benchmarks, I did not understand at all... After all, the cost of real data there will be simply huge. It looks like he was confused by the memory consumption benchmarks, but they specifically gauge the overhead and everything is relatively normal there. Although x4 is still too much, okay, I don't understand. And it had to be fixed with the next commit. :)

What is an onheap cache?

A small digression about onheap, offheap caches, dedicated cache servers.

Offheap caches and dedicated cache servers can, in principle, be considered together here. They have quite a few specific characteristics.

Both can store tens of gigabytes of data.
The cache server is usually slightly slower due to network connection costs. But here you need to remember that usually the application is very close to the cache server and there are many ways to reduce network costs (pipelining, plain tcp, etc).
When the cache reaches large sizes, the question arises about data consistency. The cache servers do it for you. And in the case of offheap caches, you will have to write your own degraded version of memcached/redis.
Since such caches often reach large sizes, it is usually possible to use simple eviction policies (such as lru, fifo, etc.). Since it will simply be very difficult for an attacker to build a scan of this size (but it is still possible), and you will compensate a bit for the disadvantages of the eviction policy by storing a large number of entries.

For example, the solutions from the bigcache article are likely to raise a large number of questions in the system design section of the interview. It looks like they didn't need the offheap cache at all and they just came up with a problem for themselves, which they heroically solved. The offheap cache in VictoriaMetrics looks like a good solution, but it's more like the main storage there.

Onheap caches are usually used in a slightly different way.

The size is usually small (I have not seen more than 10 million records or about 1 GB).
Much faster than cache server or offheap caches. Which allows them to be used in places where cache misses cost little.
It is much easier to break (to the attacker and not only), so more advanced eviction policies are often used.
A bundle in the form of an onheap cache and a cache server often works well. Onheap cache protects against load spikes and reduces latency, the server allows you to store a huge amount of data.

So I usually think of onheap caches as the L1 cache of the processor. Small and super fast :).

Hit ratio and GC

I noticed that you added a GC pause check. That's cool! But unfortunately there are several pitfalls.

Most of the GC stages run parallel to the main program. runtime.GC will perform the entire sequence of actions.
For some reason, many people do not think about it, but cache misses cost so much that these delays are very small.

In reality, the results will be worse, but let's count on a scenario that is beneficial to lru for credibility.

GC usually starts every two minutes we will count the losses during the same time for convenience.
Usually the difference between the hit ratio of lru and the hit ratio of more advanced eviction policies is 5+%. Let's assume that the difference is really only 5%.
RPS. The larger it is, the more profitable it is for caches with an advanced eviction policy, but I don't want to take a very small one either. Let it be 50 rps, even that should be enough.
Latency of the cache miss. Let's take 5ms as if redis is behind us with a huge number of stored items. In reality, this is often 100+ms.

In total we have:
rps = 50
latency = 5ms
hit_diff = 5%

lost = rps * 2 * 60 * (hit_diff / 100) * latency = 50 * 2 * 60 * (5 / 100) * 5ms = 1.5s aka 1500ms

P.S. I'm sorry for these tons of bullshit.

Phus Lu · Answer 5 · Wed Apr 17 2024 14:13:47 GMT+0800 (China Standard Time)

The discussion in maypok86/otter#72 can be a supplementary material for

why choose and respect 8/16 cores
Due to my work experiences and golang runtime limiation(not only epoll issues, but also NUMA support)
why phuslu/lru foucs memory usage/gc overhead
Because in that moment, in-memory cache acts like as a fixed-size hashtable, memory usage/gc overhead begin show their impacts.
why phuslu/lru has no evict policy
Like above, evict policy is not a must-have feature in that case, cut it off to decease memory usage.
why phuslu/lru allocate to the max size once at the beginning
Because we suppose its fullness will grow to 100% quickly and want to keep the P99 latency stability. And we can bear its memory usage at the beginning due to its smaller metadata

@maypok86 sorry for reply late, because your said above is totally correct and I agree it almost all. This referred issue describes a less common problem that I happened to encounter before and it also affected the evolution of phuslu/lru, so I jumped in and discussed it a lot and quoted it here.