Bandwidth Benchmark sometimes stalls on >=1MB puts

Question

Bandwidth Benchmark sometimes stalls on >=1MB puts

TylerADavis opened this issue 7 years ago · comments

The test benchmarks/throughput.cpp runs well on object sizes up to 50 kilobytes, but occasionally stalls on larger objects. Present in bandwidth_benchmark branch. As logging must be disabled to get the speeds for the test, the cause of stalls is not readily apparent. Benchmark had been run without resetting server in-between, could this cause issues? Errors about "pthread_setaffinity_np error 22" were thrown as well on occasion, and only in the later revisions of the test.

Current speeds: (MB/s, messages/s) (at time of issue creation)
128 bytes: 20.7 MB/s, 162072
4K bytes: 556.371 MB/s, 135833
50K bytes: 2445.7 MB/s, 47767.9
1M bytes: 4442e MB/s, 4236.22
10M bytes: 4369.74 MB/s, 416.731
100M byes: stalled entirely

Edit: ran the benchmark once more after resetting the remote server, and all tests ran, albeit after a long delay. Strangely, despite the tests taking so long, the results for transfer speeds are still rather high. This almost makes me think that the stall is happening outside of the timed section.

100M bytes: msg/s: 42.8607 bytes/s: 4494.27MB/s

~4.5 gigabytes/s is the highest I've seen any benchmark run

João Carreira · Answer 1 · Fri Jun 09 2017 01:11:09 GMT+0800 (China Standard Time)

We should have a significantly better performance with objects of size 128. Can you create an issue to investigate the causes of this?
We should have a way to log selectively. For instance, we may want to just log messages related to performance benchmarks.
You can disable logging and use std::cout statements to debug this

Tyler Davis · Answer 2 · Sat Jul 08 2017 09:07:02 GMT+0800 (China Standard Time)

@jcarreira To make sure my understanding is correct, we can pass in a threshold when we set the value of CIRRUS_LOG, and so it should be possible to set a threshold that allows for something of the form LOG and LOG but not the regular LOG? Also, would it be better to switch CIRRUS_LOG to accept something of the form "all" "none" or "partial" versus the current integer form?

João Carreira · Answer 3 · Sun Jul 09 2017 09:31:29 GMT+0800 (China Standard Time)

I think it works well now. You can do

export CIRRUS_LOG=1

to set logging on or

export CIRRUS_LOG=0

to set it off.

Tyler Davis · Answer 4 · Tue Jul 11 2017 06:44:06 GMT+0800 (China Standard Time)

at the moment, I haven't been experiencing these stalls on 1MB puts in the throughput benchmark. However, the segfault in #73 prohibits testing higher sizes.
Edit: Segfault is now in #76

Tyler Davis · Answer 5 · Wed Jul 12 2017 05:46:36 GMT+0800 (China Standard Time)

I've resolved the segfault, and have found no issues with stalling in the throughput benchmark on TCP. I'll look at the RDMA side of things further as that is where this issue originally appeared.

João Carreira · Answer 6 · Wed Aug 02 2017 06:45:15 GMT+0800 (China Standard Time)

This is solved, correct?

Tyler Davis · Answer 7 · Wed Aug 02 2017 06:46:22 GMT+0800 (China Standard Time)

I believe so, yes

…

On Aug 1, 2017, at 3:45 PM, João Carreira ***@***.***> wrote: This is solved, correct? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <jcarreira/cirrus#38 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AQBqbC7J2YSlBMlZ9JEVFLA8x302gFSPks5sT6p7gaJpZM4Nx7V3>.