Proposal: modify the logger architecture

Question

Proposal: modify the logger architecture

arschles opened this issue 8 years ago · comments

This issue attempts to holistically address #82 and modify the architecture enough to allow for production-level (10x higher than current, for now) log throughput.

My understanding is that currently, the logging subsystem in Deis is as follows:

logs ----(log tailer)----> fluentd ----(syslog)----> logger ----(http)----> InfluxDB

Issues With Current Infrastructure

Production and load testing has uncovered #82, which we believe is caused duplicated UDP packages being sent from fluentd to the logger. That duplication is caused by a bug in the Kubernetes networking subsystem - more details in kubernetes/kubernetes#25793.

We've also tried sending log messages via HTTP 1.1 from FluentD to the Logger. Since HTTP is a TCP based protocol, the duplications go away, but I believe that using HTTP 1.1 itself is not the best choice for our logging subsystem. The protocol is for request/response workloads and logs are streams of infinite length, which suggests that we use a streaming protocol.

Streaming Protocols

Below is a list of discussed (in slack) streaming solutions for fluentd -> logger communication:

Home-grown wire format over TCP
gRPC streams
Websockets
An external queue

Here are some pros/cons for each:

I'd rather use a tested wire format rather than inventing one. The only pros I can think of are that the custom protocol may be faster than alternatives. IMO, if we can do another option, we should
Currently, the Ruby code generator for gRPC and I couldn't get it to work after about 45 minutes
- I think gRPC would be the best option if it worked - it defines wire data formats, transport protocols and even generates client and server code for us. Sad face
We'll have to do our own encoding/decoding for the data over the wire, but Fluentd has a websocket plugin and we can pretty easily add a websocket server to the Logger to consume the log message (I'm happy to do that)
- Also, the above plugin can send messages in JSON and MsgPack encodings, either of which the (Go) logging server can accommodate
We could certainly add a queue to the cluster, and it looks like Fluentd has plugin support for many queues

My Preference

To keep things as simple as possible, my preference is to avoid adding more components to Deis clusters to support logging. Also, as I wrote in #1 in the previous section, I'd rather not invent a new wire protocol. That leaves us with #3 - websockets.

Thoughts?

cc/ @jchauncey

Jonathan Chauncey · Answer 1 · Sat Jun 04 2016 09:14:14 GMT+0800 (China Standard Time)

Here is what I have working right now -

 ┌───────┐             
 │  logs │             
 └─tail──┘             
     │                 
     ▼                 
 ┌───────┐     ┌──────┐
 │fluentd│────▶│influx│
 └syslog─┘     └──────┘
     │                 
     │                 
     ▼                 
 ┌──────┐              
 │logger│              
 └──────┘

Aaron Schlesinger · Answer 2 · Sat Jun 04 2016 09:19:55 GMT+0800 (China Standard Time)

thanks @jchauncey - good to know! I think the pro/cons still apply to this architecture as well

Jonathan Chauncey · Answer 3 · Sat Jun 04 2016 09:20:11 GMT+0800 (China Standard Time)

So I'm fine with any choice and a quick spike wouldnt hurt. Right now I can overwhelm logger pretty easily and fluentd has a hard time keeping up with me trying to send hundreds of requests per second to influx (even though influx can handle it just fine). I'd love to see us spike out option 4 just to see if it helps alleviate the fluentd portion of the problem.

So we would end up with this -

       ┌───────┐            
       │  logs │            
       └─tail──┘            
           │                
           ▼                
       ┌───────┐            
       │fluentd│            
       └───────┘            
         amqp               
           │                
           ▼                
       ┌──────┐             
    ┌──│queue │─────────┐   
    │  └──────┘         │   
    │      │            │   
    ▼      └──┐         ▼   
┌──────┐      ▼     ┌──────┐
│worker│  ┌──────┐  │worker│
└──────┘  │worker│  └──────┘
    │     └──────┘      │   
    └────┐    │     ┌───┘   
         ├────┘     ▼       
         ▼      ┌──────┐    
     ┌──────┐   │influx│    
     │logger│   └──────┘    
     └──────┘

Fluentd plugin would have basically the same logic it has now (filter out everything we dont care about) and then the workers are just extraction + sending to destination. We can easily scale those as replication controllers.

Jonathan Chauncey · Answer 4 · Sat Jun 04 2016 09:21:30 GMT+0800 (China Standard Time)

I haven't personally implemented a websockets infrastructure before so I'm not 100% on how it scales. But it's definitely the easiest out of the 4.

So my 2 choices are websockets + queue.

Aaron Schlesinger · Answer 5 · Sat Jun 04 2016 09:39:42 GMT+0800 (China Standard Time)

Makes sense. The websocket infrastructure is point-to-point so it doesn't scale as elastically unless the fluentd side is a bit smarter, and that gives up simplicity.

As for the queue, I'd like to add a simple queue to the mix, which likely means a non-AMQP compatible one and no persistence.

To add more detail, I haven't seen or used a simple queueing system that conforms to AMQP. For example, I've never had an easy time deploying RabbitMQ and ActiveMQ, and I'm guessing that deploying it in a container inside k8s is not a solved problem

As for simpler queues, which speak other protocols, I like both nsq and gnatsd. They're proven and simpler to start, which likely means they're more ready for k8s. Also, both have ruby bindings and at least nsq has a FluentD plugin.

Thoughts?

Also, what tool do you use for generating the ASCII diagrams?

Jonathan Chauncey · Answer 6 · Sat Jun 04 2016 09:42:42 GMT+0800 (China Standard Time)

Monodraw for the diagrams. Nsq or nats would work. Simple and no persistent
data is definitely a good start.
On Jun 3, 2016 7:39 PM, "Aaron Schlesinger" notifications@github.com
wrote:

Makes sense. The websocket infrastructure is point-to-point so it doesn't
scale as elastically unless the fluentd side is a bit smarter, and that
gives up simplicity.

As for the queue, I'd like to add a simple queue to the mix, which likely
means a non-AMQP compatible one and no persistence.

To add more detail, I haven't seen or used a simple queueing system that
conforms to AMQP. For example, I've never had an easy time deploying
RabbitMQ and ActiveMQ, and I'm guessing that deploying it in a container
inside k8s is not a solved problem

As for simpler queues, which speak other protocols, I like both nsq
https://github.com/nsqio/nsq and gnatsd
https://github.com/nats-io/gnatsd, and they're proven. Also both have
ruby bindings and at least nsq has a FluentD plugin.

Thoughts?

Also, what tool do you use for generating the ASCII diagrams?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#84 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAaRGM-geeW4-73wCb-9QYeiH-4sKLKlks5qINdfgaJpZM4IuA3q
.

Jonathan Chauncey · Answer 7 · Sat Jun 04 2016 10:28:12 GMT+0800 (China Standard Time)

K after researching a bit I think I can get an nsq spike done this weekend. I'll see if I can get it all working before Monday morning.

Jonathan Chauncey · Answer 8 · Sat Jun 04 2016 10:30:23 GMT+0800 (China Standard Time)

https://gist.github.com/bketelsen/a2d353d0318736c9fa23

Putting this here for future reference

Jonathan Chauncey · Answer 9 · Thu Jun 16 2016 04:47:54 GMT+0800 (China Standard Time)

closing this as we are moving forward with nsq for now see #85