godaddy / node-cluster-service

Turn your single process code into a fault-resilient, multi-process service with built-in REST & CLI support. Restart or hot upgrade your web servers with zero downtime or impact to clients.

Home Page:https://www.npmjs.org/package/cluster-service

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`restart-all` over HTTP API can time out if there is a high worker count

scommisso opened this issue · comments

restart-all should have a higher/configurable timeout.

cservice:  Running remote command: http://localhost:11987/cli?cmd=restart%20all&accessKey={ACCESS_KEY}
{ [Error: socket hang up] code: 'ECONNRESET' }
cservice:  Startup failed, exiting...
cservice:  { [Error: socket hang up] code: 'ECONNRESET' }

Other restart models might be nice as well, for example:

  1. Spawn replacement workers
  2. Have existing workers stop listening
  3. Wait for existing workers to complete
  4. Kill existing workers after a configurable timeout

This would allow restart-all to return almost immediately with new active workers without waiting for all old workers to be killed.

@DullReferenceException unless I'm mistaking your feedback, this is how it works. New workers are spun up first, verified "ready", and replaced workers then exit gracefully. If you start with 3 workers, during a restart you'll never have less than 3 (active/ready) workers running.

@scommisso per-command timeouts are configurable, ex: cservice restart all 120000 for a 2-minute timeout. Or are you looking for changing defaults?

I think the issue is the fact that they are restarted one at a time, so the entire operation with a large worker count where the workers have elongated deaths easily times out. For example, in our application, a restart takes over 30 minutes. If restarts took place in parallel, and the operation didn't block waiting for the restarts, the timeout issue could go away.

The parallelism for suicides can be determined as some function of the worker count. Something like Math.max(1, Math.floor(Math.log(workerCount))).

I understand now. Good addition. It should go in the v2 branch.

@scommisso @DullReferenceException -- Before I publish, please verify acceptance of the attached changesets. A new option restartConcurrencyRatio was added to enable concurrency with safe defaults (0.33).

Published 2.0.0-alpha4