jcarreira / cirrus-kv

High-performance key-value store

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Bandwidth Benchmark sometimes stalls on >=1MB puts

TylerADavis opened this issue · comments

The test benchmarks/throughput.cpp runs well on object sizes up to 50 kilobytes, but occasionally stalls on larger objects. Present in bandwidth_benchmark branch. As logging must be disabled to get the speeds for the test, the cause of stalls is not readily apparent. Benchmark had been run without resetting server in-between, could this cause issues? Errors about "pthread_setaffinity_np error 22" were thrown as well on occasion, and only in the later revisions of the test.

Current speeds: (MB/s, messages/s) (at time of issue creation)
128 bytes: 20.7 MB/s, 162072
4K bytes: 556.371 MB/s, 135833
50K bytes: 2445.7 MB/s, 47767.9
1M bytes: 4442e MB/s, 4236.22
10M bytes: 4369.74 MB/s, 416.731
100M byes: stalled entirely

Edit: ran the benchmark once more after resetting the remote server, and all tests ran, albeit after a long delay. Strangely, despite the tests taking so long, the results for transfer speeds are still rather high. This almost makes me think that the stall is happening outside of the timed section.

100M bytes: msg/s: 42.8607 bytes/s: 4494.27MB/s

~4.5 gigabytes/s is the highest I've seen any benchmark run

  1. We should have a significantly better performance with objects of size 128. Can you create an issue to investigate the causes of this?

  2. We should have a way to log selectively. For instance, we may want to just log messages related to performance benchmarks.

  3. You can disable logging and use std::cout statements to debug this

@jcarreira To make sure my understanding is correct, we can pass in a threshold when we set the value of CIRRUS_LOG, and so it should be possible to set a threshold that allows for something of the form LOG and LOG but not the regular LOG? Also, would it be better to switch CIRRUS_LOG to accept something of the form "all" "none" or "partial" versus the current integer form?

I think it works well now. You can do

export CIRRUS_LOG=1

to set logging on or

export CIRRUS_LOG=0

to set it off.

at the moment, I haven't been experiencing these stalls on 1MB puts in the throughput benchmark. However, the segfault in #73 prohibits testing higher sizes.
Edit: Segfault is now in #76

I've resolved the segfault, and have found no issues with stalling in the throughput benchmark on TCP. I'll look at the RDMA side of things further as that is where this issue originally appeared.

This is solved, correct?