`restart-all` over HTTP API can time out if there is a high worker count

Question

`restart-all` over HTTP API can time out if there is a high worker count

scommisso opened this issue 9 years ago · comments

restart-all should have a higher/configurable timeout.

cservice:  Running remote command: http://localhost:11987/cli?cmd=restart%20all&accessKey={ACCESS_KEY}
{ [Error: socket hang up] code: 'ECONNRESET' }
cservice:  Startup failed, exiting...
cservice:  { [Error: socket hang up] code: 'ECONNRESET' }

Jacob Page · Answer 1 · Wed Oct 14 2015 05:00:43 GMT+0800 (China Standard Time)

Other restart models might be nice as well, for example:

Spawn replacement workers
Have existing workers stop listening
Wait for existing workers to complete
Kill existing workers after a configurable timeout

This would allow restart-all to return almost immediately with new active workers without waiting for all old workers to be killed.

Aaron Silvas · Answer 2 · Wed Oct 14 2015 08:45:33 GMT+0800 (China Standard Time)

@DullReferenceException unless I'm mistaking your feedback, this is how it works. New workers are spun up first, verified "ready", and replaced workers then exit gracefully. If you start with 3 workers, during a restart you'll never have less than 3 (active/ready) workers running.

@scommisso per-command timeouts are configurable, ex: cservice restart all 120000 for a 2-minute timeout. Or are you looking for changing defaults?

Jacob Page · Answer 3 · Wed Oct 14 2015 10:14:04 GMT+0800 (China Standard Time)

I think the issue is the fact that they are restarted one at a time, so the entire operation with a large worker count where the workers have elongated deaths easily times out. For example, in our application, a restart takes over 30 minutes. If restarts took place in parallel, and the operation didn't block waiting for the restarts, the timeout issue could go away.

Stephen Commisso · Answer 4 · Wed Oct 14 2015 23:17:16 GMT+0800 (China Standard Time)

The parallelism for suicides can be determined as some function of the worker count. Something like Math.max(1, Math.floor(Math.log(workerCount))).

Aaron Silvas · Answer 5 · Wed Oct 14 2015 23:20:08 GMT+0800 (China Standard Time)

I understand now. Good addition. It should go in the v2 branch.

Aaron Silvas · Answer 6 · Tue Nov 17 2015 00:41:06 GMT+0800 (China Standard Time)

@scommisso @DullReferenceException -- Before I publish, please verify acceptance of the attached changesets. A new option restartConcurrencyRatio was added to enable concurrency with safe defaults (0.33).

Aaron Silvas · Answer 7 · Thu Nov 19 2015 00:15:54 GMT+0800 (China Standard Time)

Published 2.0.0-alpha4