Initial Sending Delay

Question

Initial Sending Delay

ffatghub opened this issue 2 years ago · comments

Hi all,

I am integrating logstash-logback-encoder in an application that is executing in a Kubernetes cluster. I use LogstashTcpSocketAppender to send logs to Logstash with success. Moreover, I have to use Envoy (https://github.com/envoyproxy/envoy) as injected proxy container between the application and the Logstash server. Envoy is intercepting the LogstashTcpSocketAppender TCP connection and is terminating it, then it opens another TCP connection towards the Logstash server (it might also be another proxy but this is a detail).

first picture here is worth a thousand words, the services are the application and Logstash
https://www.solo.io/blog/istios-networking-in-depth/

The Logback TCP basic protocol assumes that a log has been delivered to the destination as soon as the write() on the socket returns OK, no application level ACK is required. This is an acceptable compromise in almost all circumstances but in the context described above it might lead to loose many log events. In case all Logstash destinations go down (for any reason) the LogstashTcpSocketAppender opens a new connection - that is successfully terminated by the Envoy proxy - and immediately write all available LogEvents found in the RingBuffer into the new socket. Unfortunately the TCP connection terminated by the proxy is established but the second TCP connection between the proxy and the real target is not. After few seconds (typically 5/6) the proxy declares the attempt failed and RST the incoming TCP connection established by LogstashTcpSocketAppender. Then a new LogEvent arrives… LogstashTcpSocketAppender opens a new connection terminated by the proxy… same result, we loose logs. This goes on and on until a Logstash destination is reachable.

I propose to add a configurable small delay, called “Initial Sending Delay” (better name?) just after the TCP socket establishment. During this period of time the socket would be established but not available for sending events.

thank you
ciao

Bertrand Renuart · Answer 1 · Fri Sep 16 2022 16:52:11 GMT+0800 (China Standard Time)

Hi,
Thanks for reporting this.

As you described, LLE has indeed no application level ACK which may result in the loss of events in the transmission, especially when you have additional proxies on the path that are likely to add additional buffering. LLE does not "guarantee" the delivery: even without intermediates between LLE and the sink it may loose events already buffered in the TCP window...

As far as I understand LLE establishes a connection with Envoy which in turn tries to connect to the actual destination. This second attempts fails if the destination is not available and Envoy immediately terminates the connection with LLE. However, LLE initially thought every is ok and already started to send events to Envoy (they are now waiting in Envoy buffers or are in-flight in the TCP window). These events are lost when Envoy ultimately closes the connection with LLE. Adding a small delay before sending data in the newly established connection may be a "workaround" in your case - as long as you can find a value that works in the majority of cases. Did I get it right?

A better solution would be to add application-level ACK to the transport mechanism. We already thought about adding support for Lumberjack/Beats support to connect directly to a Logstash instance... This would be very handy to those using the ELK stack. However, it seems that most users are logging into a file and rely on a separate "scrapper" to send the content to the remote destination. Some like Logstash have stronger delivery guarantee than LLE... maybe you should give them a try ;-)

Bertrand Renuart · Answer 2 · Fri Sep 16 2022 17:25:48 GMT+0800 (China Standard Time)

I forgot to mention that you can maybe reduce the amount of lost events by disabling the TCP write buffer - see https://github.com/logfellow/logstash-logback-encoder#write-buffer-size.

ff · Answer 3 · Fri Sep 16 2022 18:39:41 GMT+0800 (China Standard Time)

Hi @brenuart

Did I get it right?

perfectly, we control our environment so the timeout would provide a reasonable workaround for corner cases (hopefully the actual destination is available most of the time)

We already thought about adding support for Lumberjack/Beats

that would be nice

maybe you should give them a try ;-)

but direct streaming is better :-)

ff · Answer 4 · Fri Sep 16 2022 20:10:07 GMT+0800 (China Standard Time)

I forgot to mention that you can maybe reduce the amount of lost events by disabling the TCP write buffer - see https://github.com/logfellow/logstash-logback-encoder#write-buffer-size.

thanks it helps, but still a delay is necessary right after the connection is established (because reopenSocket is called by onEvent, so that event will always be lost)

Bertrand Renuart · Answer 5 · Fri Sep 16 2022 20:16:46 GMT+0800 (China Standard Time)

Yeah... as we said earlier, unfortunately a few events are always likely to be lost after reconnecting - there is nothing much we can do without application level ACKs :(
I'll have a look at this "initial send delay" feature and will come back asap with my findings. Stay tuned.

ff · Answer 6 · Mon Sep 26 2022 21:13:26 GMT+0800 (China Standard Time)

Hi @brenuart, thank you