`restart-all` over HTTP API can time out if there is a high worker count
scommisso opened this issue · comments
restart-all
should have a higher/configurable timeout.
cservice: Running remote command: http://localhost:11987/cli?cmd=restart%20all&accessKey={ACCESS_KEY}
{ [Error: socket hang up] code: 'ECONNRESET' }
cservice: Startup failed, exiting...
cservice: { [Error: socket hang up] code: 'ECONNRESET' }
Other restart models might be nice as well, for example:
- Spawn replacement workers
- Have existing workers stop listening
- Wait for existing workers to complete
- Kill existing workers after a configurable timeout
This would allow restart-all
to return almost immediately with new active workers without waiting for all old workers to be killed.
@DullReferenceException unless I'm mistaking your feedback, this is how it works. New workers are spun up first, verified "ready", and replaced workers then exit gracefully. If you start with 3 workers, during a restart you'll never have less than 3 (active/ready) workers running.
@scommisso per-command timeouts are configurable, ex: cservice restart all 120000
for a 2-minute timeout. Or are you looking for changing defaults?
I think the issue is the fact that they are restarted one at a time, so the entire operation with a large worker count where the workers have elongated deaths easily times out. For example, in our application, a restart takes over 30 minutes. If restarts took place in parallel, and the operation didn't block waiting for the restarts, the timeout issue could go away.
The parallelism for suicides can be determined as some function of the worker count. Something like Math.max(1, Math.floor(Math.log(workerCount)))
.
I understand now. Good addition. It should go in the v2
branch.
@scommisso @DullReferenceException -- Before I publish, please verify acceptance of the attached changesets. A new option restartConcurrencyRatio
was added to enable concurrency with safe defaults (0.33
).
Published 2.0.0-alpha4