deis / logger

In-memory log buffer used by Deis Workflow.

Home Page:https://deis.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Proposal: modify the logger architecture

arschles opened this issue · comments

This issue attempts to holistically address #82 and modify the architecture enough to allow for production-level (10x higher than current, for now) log throughput.

My understanding is that currently, the logging subsystem in Deis is as follows:

logs ----(log tailer)----> fluentd ----(syslog)----> logger ----(http)----> InfluxDB

Issues With Current Infrastructure

Production and load testing has uncovered #82, which we believe is caused duplicated UDP packages being sent from fluentd to the logger. That duplication is caused by a bug in the Kubernetes networking subsystem - more details in kubernetes/kubernetes#25793.

We've also tried sending log messages via HTTP 1.1 from FluentD to the Logger. Since HTTP is a TCP based protocol, the duplications go away, but I believe that using HTTP 1.1 itself is not the best choice for our logging subsystem. The protocol is for request/response workloads and logs are streams of infinite length, which suggests that we use a streaming protocol.

Streaming Protocols

Below is a list of discussed (in slack) streaming solutions for fluentd -> logger communication:

  1. Home-grown wire format over TCP
  2. gRPC streams
  3. Websockets
  4. An external queue

Here are some pros/cons for each:

  1. I'd rather use a tested wire format rather than inventing one. The only pros I can think of are that the custom protocol may be faster than alternatives. IMO, if we can do another option, we should
  2. Currently, the Ruby code generator for gRPC and I couldn't get it to work after about 45 minutes
    • I think gRPC would be the best option if it worked - it defines wire data formats, transport protocols and even generates client and server code for us. Sad face
  3. We'll have to do our own encoding/decoding for the data over the wire, but Fluentd has a websocket plugin and we can pretty easily add a websocket server to the Logger to consume the log message (I'm happy to do that)
    • Also, the above plugin can send messages in JSON and MsgPack encodings, either of which the (Go) logging server can accommodate
  4. We could certainly add a queue to the cluster, and it looks like Fluentd has plugin support for many queues

My Preference

To keep things as simple as possible, my preference is to avoid adding more components to Deis clusters to support logging. Also, as I wrote in #1 in the previous section, I'd rather not invent a new wire protocol. That leaves us with #3 - websockets.

Thoughts?

cc/ @jchauncey

Here is what I have working right now -

 ┌───────┐             
 │  logs │             
 └─tail──┘             
     │                 
     ▼                 
 ┌───────┐     ┌──────┐
 │fluentd│────▶│influx│
 └syslog─┘     └──────┘
     │                 
     │                 
     ▼                 
 ┌──────┐              
 │logger│              
 └──────┘              

thanks @jchauncey - good to know! I think the pro/cons still apply to this architecture as well

So I'm fine with any choice and a quick spike wouldnt hurt. Right now I can overwhelm logger pretty easily and fluentd has a hard time keeping up with me trying to send hundreds of requests per second to influx (even though influx can handle it just fine). I'd love to see us spike out option 4 just to see if it helps alleviate the fluentd portion of the problem.

So we would end up with this -

       ┌───────┐            
       │  logs │            
       └─tail──┘            
           │                
           ▼                
       ┌───────┐            
       │fluentd│            
       └───────┘            
         amqp               
           │                
           ▼                
       ┌──────┐             
    ┌──│queue │─────────┐   
    │  └──────┘         │   
    │      │            │   
    ▼      └──┐         ▼   
┌──────┐      ▼     ┌──────┐
│worker│  ┌──────┐  │worker│
└──────┘  │worker│  └──────┘
    │     └──────┘      │   
    └────┐    │     ┌───┘   
         ├────┘     ▼       
         ▼      ┌──────┐    
     ┌──────┐   │influx│    
     │logger│   └──────┘    
     └──────┘               

Fluentd plugin would have basically the same logic it has now (filter out everything we dont care about) and then the workers are just extraction + sending to destination. We can easily scale those as replication controllers.

I haven't personally implemented a websockets infrastructure before so I'm not 100% on how it scales. But it's definitely the easiest out of the 4.

So my 2 choices are websockets + queue.

Makes sense. The websocket infrastructure is point-to-point so it doesn't scale as elastically unless the fluentd side is a bit smarter, and that gives up simplicity.

As for the queue, I'd like to add a simple queue to the mix, which likely means a non-AMQP compatible one and no persistence.

To add more detail, I haven't seen or used a simple queueing system that conforms to AMQP. For example, I've never had an easy time deploying RabbitMQ and ActiveMQ, and I'm guessing that deploying it in a container inside k8s is not a solved problem

As for simpler queues, which speak other protocols, I like both nsq and gnatsd. They're proven and simpler to start, which likely means they're more ready for k8s. Also, both have ruby bindings and at least nsq has a FluentD plugin.

Thoughts?

Also, what tool do you use for generating the ASCII diagrams?

Monodraw for the diagrams. Nsq or nats would work. Simple and no persistent
data is definitely a good start.
On Jun 3, 2016 7:39 PM, "Aaron Schlesinger" notifications@github.com
wrote:

Makes sense. The websocket infrastructure is point-to-point so it doesn't
scale as elastically unless the fluentd side is a bit smarter, and that
gives up simplicity.

As for the queue, I'd like to add a simple queue to the mix, which likely
means a non-AMQP compatible one and no persistence.

To add more detail, I haven't seen or used a simple queueing system that
conforms to AMQP. For example, I've never had an easy time deploying
RabbitMQ and ActiveMQ, and I'm guessing that deploying it in a container
inside k8s is not a solved problem

As for simpler queues, which speak other protocols, I like both nsq
https://github.com/nsqio/nsq and gnatsd
https://github.com/nats-io/gnatsd, and they're proven. Also both have
ruby bindings and at least nsq has a FluentD plugin.

Thoughts?

Also, what tool do you use for generating the ASCII diagrams?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#84 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAaRGM-geeW4-73wCb-9QYeiH-4sKLKlks5qINdfgaJpZM4IuA3q
.

K after researching a bit I think I can get an nsq spike done this weekend. I'll see if I can get it all working before Monday morning.

closing this as we are moving forward with nsq for now see #85