GELF Benchmarks are slow vs UDP

Question

GELF Benchmarks are slow vs UDP

allcentury opened this issue 9 years ago · comments

Normally I would pair this issue with a pull request but your guys requirement to sign a disclosure and waiver is something I can't do. That 'release form' is completely different than the license you've attached to this repo so I don't know what to believe.

That all said, I'm going to list out what I would fix if this was a normal open source project in hopes that you agree and will implement the change.

First, the problem. GELF was slowing down our servers, which seems odd especially considering GELF uses UDP under the hood which is a 'fire and forget' protocol. I decided to Benchmark UDP vs. GELF's implementation of UDP.

The Benchmark results:

Benchmark/ip:

Calculating -------------------------------------
                GELF       263 i/100ms
                 UDP      9753 i/100ms
-------------------------------------------------
                GELF     2696.2 (±4.1%) i/s -      13676 in   5.081721s
                 UDP   115698.4 (±7.2%) i/s -     575427 in   5.006171s

Comparison:
                 UDP:   115698.4 i/s
                GELF:     2696.2 i/s - 42.91x slower

Benchmark.bm

                      user     system      total        real
GELF             16.400000  10.830000  27.230000 ( 39.284368)
UDP               0.300000   0.560000   0.860000 (  0.871986)

Staggering really - UDP can log 90_000-100_000 msgs / second, GELF's implementation can do 2500-3000 msgs / second.

The benchmark script looks like this:

require 'gelf'
require 'benchmark/ips'

# GELF set up
glogger = GELF::Logger.new("localhost", 5352)

# UDP set up
require 'socket'
s = UDPSocket.new
s.connect('127.0.0.1', 1234)

Benchmark.ips do |x|
  x.report("GELF") { glogger.info 1 }
  x.report("UDP")  { s.send("#{1}", 0) }
end

require 'benchmark'

Benchmark.bm(15) do |x|
  x.report("GELF") { 100_000.times { |i| glogger.info i } }
  x.report("UDP")  { 100_000.times { |i| s.send("#{i}", 0) } }
end

You'll need to run a UDP server in a separate process for those to work:

require 'socket'
s = UDPSocket.new
s.bind(nil, 1234)
loop do
  text, sender = s.recvfrom(16)
end

So what's different? Well it's really a small change. In the GELF gem, we never open the connection until we call the send method with arguments containing host and port.

It looks like this @socket.send(datagram, 0, host, port). That block of code is here.

When we use send with host & port, a connection must first be established before sending then the packet gets sent and finally the connection closes for EVERY message. Instead if you see in my benchmark script, I first connect then log with just send(arg, 0). Since the connection is already established, the message is fired and forgotten.

Jochen Schalanda · Answer 1 · Tue May 12 2015 21:50:55 GMT+0800 (China Standard Time)

When we use send with host & port, a connection must first be established before sending then the packet gets sent and finally the connection closes for EVERY message.

UDP is a connectionless protocol (in contrast to TCP, which isn't being used here), so I don't think that this is the reason for the GELF gem being "slow" because there simply isn't any overhead for creating and closing a connection.

Actually it doesn't surprise me at all that sending an almost empty UDP packet is faster than building a full-fledged GELF message from arbitrary parameters (see GELF specification) and sending the resulting message via UDP.

Don't get me wrong, I'm all for optimizing the GELF gem if there's a real bottleneck, but the benchmarks in this issue are skewed and simply measure the wrong things.

Anthony Ross · Answer 2 · Tue May 12 2015 23:07:00 GMT+0800 (China Standard Time)

I thought I had an error in my early benchmarks as well but to mimic the behavior of the Gelf gem in the simplest form try this:

run a udp server:

require 'socket'
s = UDPSocket.new
s.bind(nil, 1234)
loop do
  text, sender = s.recvfrom_nonblock(16)
end

Here's how Gelf opens and closes connections to send messages:

require 'socket'
s = UDPSocket.new
before_time = Time.now
100_000.times do |i|
  s.send("#{i}", 0, 'localhost', 1234)
end
puts "Time elapsed was #{Time.now - before_time}"

Here's the same implementation except we make the connection outside the loop:

require 'socket'
s = UDPSocket.new
before_time = Time.now
s.connect('127.0.0.1', 1234)
100_000.times do |i|
  s.send("#{i}", 0)
end
puts "Time elapsed was #{Time.now - before_time}"

Here's the output of that benchmark:

Using send with a host:
Time elapsed was 28.917275
----------------------------


Connecting before using send
Time elapsed was 1.469474
----------------------------

Here's the gist to run it: https://gist.github.com/08ab203dec55f1ebd1ad

Mark Glenn · Answer 3 · Wed May 13 2015 03:06:21 GMT+0800 (China Standard Time)

I'm seeing similar performance numbers as @allcentury. Of course, the gem supports multiple addresses so multiple UDP sockets would need to be created for multiple addresses to see this performance gain.

I'd be okay with keeping multiple pre-bound UDPSockets alive. Since this is UDP, it's only really keeping around a file descriptor and a struct addrinfo

It looks like, from the MRI source, that prebinding the socket causes getaddrinfo() to be called only once, which is what causes a DNS lookup. I assume this is what is causing the 50x (for me) speed up. When calling send with the host and port, it causes getaddrinfo() and freeaddrinfo() to be called every call.

Jochen Schalanda · Answer 4 · Wed May 13 2015 16:14:38 GMT+0800 (China Standard Time)

@allcentury Interesting, I wasn't aware that there's such a performance hit when not explicitly creating a "connection" in a UDP socket. Sorry I came across a bit harsh in my last comment.

@markglenn Thanks for the quick analysis of the UDPSocket source. It makes total sense that this overhead is coming from DNS lookups. Too bad we can't cache that information somewhere and put it into the UDPSocket.send call.

As always, pull requests fixing this issue are welcome.

Bernd Ahlers · Answer 5 · Wed May 13 2015 20:03:24 GMT+0800 (China Standard Time)

Executing connect() explicitly also means that you do not get the latest ip address if that changes, right? This might be an issue.

Mark Glenn · Answer 6 · Thu May 14 2015 02:54:15 GMT+0800 (China Standard Time)

@bernd That is correct. I'm not sure how often the IP changes for an organization's Graylog server that a 50x (again my run through of the benchmark) slowdown is worth it. Our organization made the IP static, but I know I can't make that same assumption for everyone. Unfortunately because of UDP's fire and forget method of sending packets, there is no real way of checking this without doing another getaddrinfo() call.

TCP has a similar issue, but I believe the connection is dropped when the IP changes, so it's easy to detect.

I feel like there can be a middle ground here, but I don't see it off hand. Perhaps we can do another DNS lookup every few messages, but that guarantees at least a few lost messages when an IP change does occur. I'm not terribly familiar with the Graylog server, but is there a way to keep a TCP connection alive just for this check?

Bernd Ahlers · Answer 7 · Mon May 18 2015 16:43:30 GMT+0800 (China Standard Time)

@markglenn I totally agree that it should be fast by default. Maybe you can add some kind of option that controls the behavior? Otherwise people that use round-robin DNS or something like that will have problems.

The Graylog server has an API endpoint that can be used to check the status. /system/lbstatus

Mark Glenn · Answer 8 · Wed May 20 2015 04:24:27 GMT+0800 (China Standard Time)

I vote that this should be closed. If you have a static IP address, you can just use the IP instead of the server name and get similar results as pre-binding the connection. I ran the gist from @allcentury but with '127.0.0.1' instead of 'localhost' and saw the following results:

Using send with a host:
Time elapsed was 0.857663
----------------------------


Connecting before using send
Time elapsed was 0.623184
----------------------------

Maybe a note in the docs to reflect how much slower using a name can be? Any real check to make sure the message is going to the right server would be slower than doing a DNS lookup anyway.

Anthony Ross · Answer 9 · Wed May 20 2015 05:26:11 GMT+0800 (China Standard Time)

Very interesting and a great find @markglenn !