Feature Request: Hot Reload

Question

Feature Request: Hot Reload

elhackeado opened this issue a year ago · comments

Feature Description:

Hot reloading functionality will enable river to dynamically reload configuration file without requiring a restart of the application or service. This capability enhances system flexibility, uptime, and ease of maintenance by allowing administrators to make configuration changes on-the-fly while the application is still running.

How Nginx implemented it ?

In order for nginx to re-read the configuration file, a HUP signal should be sent to the master process. The master process first checks the syntax validity, then tries to apply new configuration, that is, to open log files and new listen sockets. If this fails, it rolls back changes and continues to work with old configuration. If this succeeds, it starts new worker processes, and sends messages to old worker processes requesting them to shut down gracefully. Old worker processes close listen sockets and continue to service old clients. After all clients are serviced, old worker processes are shut down.

Let’s illustrate this by example. Imagine that nginx is run on FreeBSD and the command

ps axw -o pid,ppid,user,%cpu,vsz,wchan,command | egrep '(nginx|PID)'
produces the following output:

  PID  PPID USER    %CPU   VSZ WCHAN  COMMAND
33126     1 root     0.0  1148 pause  nginx: master process /usr/local/nginx/sbin/nginx
33127 33126 nobody   0.0  1380 kqread nginx: worker process (nginx)
33128 33126 nobody   0.0  1364 kqread nginx: worker process (nginx)
33129 33126 nobody   0.0  1364 kqread nginx: worker process (nginx)

If HUP is sent to the master process, the output becomes:

  PID  PPID USER    %CPU   VSZ WCHAN  COMMAND
33126     1 root     0.0  1164 pause  nginx: master process /usr/local/nginx/sbin/nginx
33129 33126 nobody   0.0  1380 kqread nginx: worker process is shutting down (nginx)
33134 33126 nobody   0.0  1368 kqread nginx: worker process (nginx)
33135 33126 nobody   0.0  1368 kqread nginx: worker process (nginx)
33136 33126 nobody   0.0  1368 kqread nginx: worker process (nginx)

One of the old worker processes with PID 33129 still continues to work. After some time it exits:

  PID  PPID USER    %CPU   VSZ WCHAN  COMMAND
33126     1 root     0.0  1164 pause  nginx: master process /usr/local/nginx/sbin/nginx
33134 33126 nobody   0.0  1368 kqread nginx: worker process (nginx)
33135 33126 nobody   0.0  1368 kqread nginx: worker process (nginx)
33136 33126 nobody   0.0  1368 kqread nginx: worker process (nginx)

[SOURCE] https://nginx.org/en/docs/control.html

Any limitations with Nginx's approach ?

Too frequent hot reloading would make connections unstable and lose business data.

When NGINX executes the reload command, the old worker process will keep processing the existing connections and automatically disconnect once it processes all remaining requests. However, if the client hasn’t processed all requests, they will lose business data of the remaining requests forever. Of course, this would raise client-side users’ attention.

In some circumstances, the recycling time of the old worker process takes so long that it affects regular business.

For example, when we proxy WebSocket protocol, we can’t know whether a request has been processed because NGINX doesn’t parse the header frame. So even though the worker process receives the quit command from the master process, it can’t exit until these connections raise exceptions, time out, or disconnect.

Here is another example, when NGINX performs as the reverse proxy for TCP and UDP, it can’t know how often a request is being requested before it finally gets shut down.

Therefore, the old worker process usually takes a long time, especially in industries like live streaming, media, and speech recognition. Sometimes, the recycling time of the old worker process could reach half an hour or even longer. Meanwhile, if users frequently reload the server, it will create many shutting down processes and finally lead to NGINX OOM, which could seriously affect the business.

APISIX solved this problem in their own way, do checkout this article before taking any design decision. https://api7.ai/blog/how-nginx-reload-work

moderation · Answer 1 · Thu Apr 11 2024 09:09:00 GMT+0800 (China Standard Time)

Envoy proxy has implements hot restart and it is used at scale. Envoy hot restart from Envoy creator @mattklein123 and recent documentation

Wu WeiChao · Answer 2 · Fri Apr 12 2024 16:38:38 GMT+0800 (China Standard Time)

Envoy proxy has implements hot restart and it is used at scale. Envoy hot restart from Envoy creator @mattklein123 and recent documentation

Author said is about reload configuration, not restart binary itself. there has some different.

James Munns · Answer 3 · Fri Apr 12 2024 19:43:17 GMT+0800 (China Standard Time)

As a note, pingora already starts hot-reload: https://github.com/cloudflare/pingora/blob/main/docs/user_guide/start_stop.md edit: also https://github.com/cloudflare/pingora/blob/main/docs/user_guide/graceful.md

It is likely river will take a similar path, doing a hot-reload (e.g. starting and stopping the binary, but maintaining connections).

It's possible this could be implemented in a way that doesn't require starting a new process, but as this is implemented within pingora itself, it's likely River will mimic their implementation 1:1.

James Munns · Answer 4 · Fri Apr 12 2024 19:47:08 GMT+0800 (China Standard Time)

Putting this in the "Backlog" milestone, as I'm not sure if this will make it into River before the end of April, but it might.

Et7f3 · Answer 5 · Sun Apr 14 2024 05:27:56 GMT+0800 (China Standard Time)

I also add other techniques that can be applied to process like docker container https://iximiuz.com/en/posts/multiple-containers-same-port-reverse-proxy/

aman · Answer 6 · Sun Apr 14 2024 15:15:45 GMT+0800 (China Standard Time)

As a note, pingora already starts hot-reload: https://github.com/cloudflare/pingora/blob/main/docs/user_guide/start_stop.md edit: also https://github.com/cloudflare/pingora/blob/main/docs/user_guide/graceful.md
@jamesmunns I believe Pingora's Graceful Upgrade is the way to go. Since Pingora is already battle tested in production, At this point I would rather rely on Pingora's way of doing it rather than bringing something new which may mature over time.

Simon Studer · Answer 7 · Wed Apr 17 2024 02:14:35 GMT+0800 (China Standard Time)

There is also a different Rust-based reverse proxy project with strong focus on changing configurations without any downtime or lost connections: https://github.com/sozu-proxy/sozu.

Quote from their website (https://github.com/sozu-proxy/sozu):

SŌZU is a HTTP reverse proxy built in Rust, that can handle fine grained configuration changes at runtime without reloads, and designed to never ever stop.

I am not sure how they achieve it exactly but their implementation might be worth looking into when designing this feature.

James Munns · Answer 8 · Fri May 24 2024 19:36:23 GMT+0800 (China Standard Time)

Noting that this has been scheduled for the just-starting-now milestone, should have some progress on this in the next weeks.

James Munns · Answer 9 · Tue Jul 23 2024 00:03:55 GMT+0800 (China Standard Time)

This was implemented by #49, please feel free to open an issue if there are any follow-on needs!