UDP Performance Issues

Question

UDP Performance Issues

GaetanLongree opened this issue 6 years ago · comments

As a I mentioned a while back on Gitter, during a internship project, I attempted testing the UDP performance of unikernels by developing a very (probably extremely) simplistic DNS server using the UDP sockets in IncludeOS (code for the DNS server here).

Due to my command line only environment, I used DNSPerf as a benchmarking tool to perform queries from a benchmarking server (100 simulated clients over 4 threads) to another server hosting the service. The tests were performed over 5-minute periods, starting at 100 queries per second then increasing until service failure.

My results, partially posted here with full data here showed that the service refused to process queries at a higher throughput than approximately 850 queries per second.

I doubt the issue is related to the code, as the same code used with containers and tested in the same manner proved capable of processing higher throughput (results of my benchmark here).

This benchmark should be reproducible, as I've written scripts to both deploy the unikernel and launch the benchmark, which is available on the project's repository. Do note that I performed my tests on Ubuntu 16.04 using KVM/QEMU (as should be documented on the project repo).

Andreas Åkesson · Answer 1 · Fri May 25 2018 17:29:47 GMT+0800 (China Standard Time)

Congratulations on a nice project :)

I've been using your service to run some of your benchmarks myself (by taking some shortcuts and tweaking some numbers), and the bottleneck seems to be that all the requests are being printed to serial. This is very expensive since writing a char to the serial port causes VM exit.

With printing enabled I'm reaching kinda the same values as you (loss around 900 QPS).

With printing disabled I'm still running the benchmark and just passed 2600 QPS without loss.

Here is a nifty snippet that we use in the kernel to enable/disable debugging output.

//#define VERBOSE
#ifdef VERBOSE
#define PRINT(fmt, ...) printf(fmt, ##__VA_ARGS__)
#else
#define PRINT(fmt, ...) /* fmt */
#endif

You can add this to your service and replace all your printf with PRINT in the on_read callback, and see if you can reach a higher result.

Andreas Åkesson · Answer 2 · Fri May 25 2018 19:33:09 GMT+0800 (China Standard Time)

At about 21 500 QPS the throughput starts to fall off when running emulated with qemu on my Mac, using the virtionet driver.

vmxnet3 looks more promising with a roof somewhere around 24 500.

e1000(e) is a bit different. Seems to be reliable throughput up to 20 300. Then it starts to show result less than the QPS given, but still more than 20 300.

Andreas Åkesson · Answer 3 · Fri May 25 2018 21:05:47 GMT+0800 (China Standard Time)

I also run some tests on a linux machine (@fwsGonzo can fill in the specs) which is not emulated.

With virtionet I reach around 150 000 QPS.

########## LAUNCHING TEST 4 ##########
Queries per seconds: 140000

Queries sent:        4199998
Queries completed:    4199998
Queries Lost:        0
Queries per second:    139999.821333
Average latency:    0.000043
Minimum latency:    0.000041
Max latency:        0.004520
Latency Std Dev:    0.000059
########## TEST 4 COMPLETE ##########

########## LAUNCHING TEST 5 ##########
Queries per seconds: 150000

Queries sent:        4499996
Queries completed:    4499996
Queries Lost:        0
Queries per second:    149999.801667
Average latency:    0.000044
Minimum latency:    0.000046
Max latency:        0.008042
Latency Std Dev:    0.000055
########## TEST 5 COMPLETE ##########

########## LAUNCHING TEST 6 ##########
Queries per seconds: 160000

Queries sent:        4761272
Queries completed:    4761272
Queries Lost:        0
Queries per second:    140481.150401
Average latency:    0.000139
Minimum latency:    0.000042
Max latency:        16.537947
Latency Std Dev:    0.019043
########## TEST 6 COMPLETE ##########

########## LAUNCHING TEST 7 ##########
Queries per seconds: 170000
Queries sent:        5099993
Queries completed:    5099892
Queries Lost:        101
Queries per second:    169996.315002
Average latency:    0.000324
Minimum latency:    0.000046
Max latency:        8.908081
Latency Std Dev:    0.009910
########## TEST 7 COMPLETE ##########

Gaetan Longree · Answer 4 · Sat May 26 2018 22:45:02 GMT+0800 (China Standard Time)

Wow, I would never have guessed the serial output would be the root cause of this. I originally set it up in order to present the functionality to my professors. I probably should have run the test with a version without serial output, one shortcut too many on my part. :/

Thanks for the feedback, I'll be sure to update the report with this updated information to reflect the error came from my lack of knowledge of the process and my benchmarking process rather than IncludeOS. :)

Out of curiosity, concerning the various QPS results, what do you reckon are the main bottlenecks of these performances? Is it the interaction with the hypervisors (which would explain the increase on bare metal)?

Andreas Åkesson · Answer 5 · Wed May 30 2018 16:00:07 GMT+0800 (China Standard Time)

Hehe well it's not obvious that print is expensive. Please keep us updated with any new results/comparisons you find! :)

Not totally sure with what you mean with the various results? If you mean by comparing Test 6 and Test 7 in my post above I don't actually know.

Gaetan Longree · Answer 6 · Wed May 30 2018 19:45:11 GMT+0800 (China Standard Time)

I was mostly referring to the difference between the emulated and non emulated. I'm assuming that by emulated you refer to virtualized, thus non-emulated would be bare-metal.

So my curiosity lies with how come there is such a gap in performance in emulated vs non-emulated and what would be the root cause of this? Furthermore, have you performed any benchmarks on other hypervisors like ESXi and openStack or even on cloud platforms?

Alfred Bratterud · Answer 7 · Wed May 30 2018 20:41:40 GMT+0800 (China Standard Time)

@GaetanLongree in both cases we're talking about virtual machines controlled by Qemu. By "emulated" we mean qemu running without hardware acceleration, e.g. hardware supported virtualization which has to be supported both by the CPU and in the host kernel. On linux we enable hardware virtualization using the -enable-kvm flag, which will then allow qemu to use the KVM kernel module to enable and control hardware virtualization. On mac however, kvm is not available, so the default there is to emulate more of the hardware which is much slower.

Alf-André Walla · Answer 8 · Wed May 30 2018 20:47:40 GMT+0800 (China Standard Time)

There is some good discussion here to explain emulation vs virtualization and why there is a performance gap: https://stackoverflow.com/questions/6044978/full-emulation-vs-full-virtualization
I liked Peter Cordes explanation.