deis / nsq

A kubernetes based docker image for running nsqd

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add a chart for nsqlookupd

arschles opened this issue · comments

nsqlookupd will be important for us if we want to add more nsqd instances - otherwise known as horizontally scaling.

See http://nsq.io/components/nsqlookupd.html for documentation on nsqlookupd

Notes on horizontally scaling our queueing subsystem

Note that, as of this writing, I have been able to run 4000 concurrent instances of this command:

boom -n 1000 -c 500 ...

In english that test produced 4,000,000 total log messages with up to 2,000,000 of them concurrently entering the queue. A single nsqd pod was able to handle that specific workload.

This test was run on a Google Container Engine cluster of 10 n1-standard-2 (2 vCPUs, 7.5 GB memory) nodes.

When I ran 5000 concurrent instances of the above boom command, the nsq crashed.

So it is completely possible to run nsqlookup within this image. I had that working at first but had the following concerns.

  1. You will need to dynamically look up nsqlookupd pods within the boot script for nsqd. This means you have a dependency on start order (nsqlookupd must start first). That's fine as we can just exit(1) within the bootscript if we dont find any nsqlookupd ips/pods.
  2. You will need to use nsqlookup's pod ip not a service ip.
  3. Im not sure what we will do about scaling the nsqlookup pods (will we ever need to do that?). If you do scale nsqlookupd from 1 to N you will probably want to tell nsqd about that change too. You will definitely want to tell logger/fluentd...

@jchauncey thanks for those points. Comments inline

  1. I think exit(1) is ok. we're currently doing that in a lot of other places like the builder, controller and logger when the controller, database, and nsqd (respectively) are not available, and just letting kubernetes restart the pod as necessary
  2. Why the pod IP?
  3. Good point. I was thinking that this issue would be scoped to just a single nsqlookupd. It still represents a single point of failure, but it would be a good start since a failed lookup node wouldn't stop logs from going into or out of nsqd. For failover, we would certainly need to build some more intelligence into the system to route discovery traffic around failed nodes. Thoughts on ignoring high availability for nsqlookupd nodes in this issue?

So I originally had nsqlookupd and nsqd working but decided if I could figured out how to scale nsqlookupd why not just have 1 nsqd instance.

Maybe we dont need to use the pod ip. Reason I said that was that nsqd needs to know where it can find all nsqlookupd instances. But if you used the service ip here it would just round robin between those pods which might work well.

Closing this for the same reasons as #4 (comment).