memorysafety / river

This repository is the home of the River reverse proxy application, based on the pingora library from Cloudflare.

Home Page:https://www.memorysafety.org/initiative/reverse-proxy/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Feature Request: Hot Reload

elhackeado opened this issue · comments

commented

Feature Description:

Hot reloading functionality will enable river to dynamically reload configuration file without requiring a restart of the application or service. This capability enhances system flexibility, uptime, and ease of maintenance by allowing administrators to make configuration changes on-the-fly while the application is still running.

How Nginx implemented it ?

In order for nginx to re-read the configuration file, a HUP signal should be sent to the master process. The master process first checks the syntax validity, then tries to apply new configuration, that is, to open log files and new listen sockets. If this fails, it rolls back changes and continues to work with old configuration. If this succeeds, it starts new worker processes, and sends messages to old worker processes requesting them to shut down gracefully. Old worker processes close listen sockets and continue to service old clients. After all clients are serviced, old worker processes are shut down.

Let’s illustrate this by example. Imagine that nginx is run on FreeBSD and the command

ps axw -o pid,ppid,user,%cpu,vsz,wchan,command | egrep '(nginx|PID)'
produces the following output:

  PID  PPID USER    %CPU   VSZ WCHAN  COMMAND
33126     1 root     0.0  1148 pause  nginx: master process /usr/local/nginx/sbin/nginx
33127 33126 nobody   0.0  1380 kqread nginx: worker process (nginx)
33128 33126 nobody   0.0  1364 kqread nginx: worker process (nginx)
33129 33126 nobody   0.0  1364 kqread nginx: worker process (nginx)

If HUP is sent to the master process, the output becomes:

  PID  PPID USER    %CPU   VSZ WCHAN  COMMAND
33126     1 root     0.0  1164 pause  nginx: master process /usr/local/nginx/sbin/nginx
33129 33126 nobody   0.0  1380 kqread nginx: worker process is shutting down (nginx)
33134 33126 nobody   0.0  1368 kqread nginx: worker process (nginx)
33135 33126 nobody   0.0  1368 kqread nginx: worker process (nginx)
33136 33126 nobody   0.0  1368 kqread nginx: worker process (nginx)

One of the old worker processes with PID 33129 still continues to work. After some time it exits:

  PID  PPID USER    %CPU   VSZ WCHAN  COMMAND
33126     1 root     0.0  1164 pause  nginx: master process /usr/local/nginx/sbin/nginx
33134 33126 nobody   0.0  1368 kqread nginx: worker process (nginx)
33135 33126 nobody   0.0  1368 kqread nginx: worker process (nginx)
33136 33126 nobody   0.0  1368 kqread nginx: worker process (nginx)

[SOURCE] https://nginx.org/en/docs/control.html

Any limitations with Nginx's approach ?

Too frequent hot reloading would make connections unstable and lose business data.

When NGINX executes the reload command, the old worker process will keep processing the existing connections and automatically disconnect once it processes all remaining requests. However, if the client hasn’t processed all requests, they will lose business data of the remaining requests forever. Of course, this would raise client-side users’ attention.

In some circumstances, the recycling time of the old worker process takes so long that it affects regular business.

For example, when we proxy WebSocket protocol, we can’t know whether a request has been processed because NGINX doesn’t parse the header frame. So even though the worker process receives the quit command from the master process, it can’t exit until these connections raise exceptions, time out, or disconnect.

Here is another example, when NGINX performs as the reverse proxy for TCP and UDP, it can’t know how often a request is being requested before it finally gets shut down.

Therefore, the old worker process usually takes a long time, especially in industries like live streaming, media, and speech recognition. Sometimes, the recycling time of the old worker process could reach half an hour or even longer. Meanwhile, if users frequently reload the server, it will create many shutting down processes and finally lead to NGINX OOM, which could seriously affect the business.

APISIX solved this problem in their own way, do checkout this article before taking any design decision. https://api7.ai/blog/how-nginx-reload-work

Envoy proxy has implements hot restart and it is used at scale. Envoy hot restart from Envoy creator @mattklein123 and recent documentation

Envoy proxy has implements hot restart and it is used at scale. Envoy hot restart from Envoy creator @mattklein123 and recent documentation

Author said is about reload configuration, not restart binary itself. there has some different.

As a note, pingora already starts hot-reload: https://github.com/cloudflare/pingora/blob/main/docs/user_guide/start_stop.md edit: also https://github.com/cloudflare/pingora/blob/main/docs/user_guide/graceful.md

It is likely river will take a similar path, doing a hot-reload (e.g. starting and stopping the binary, but maintaining connections).

It's possible this could be implemented in a way that doesn't require starting a new process, but as this is implemented within pingora itself, it's likely River will mimic their implementation 1:1.

Putting this in the "Backlog" milestone, as I'm not sure if this will make it into River before the end of April, but it might.

I also add other techniques that can be applied to process like docker container https://iximiuz.com/en/posts/multiple-containers-same-port-reverse-proxy/

commented

As a note, pingora already starts hot-reload: https://github.com/cloudflare/pingora/blob/main/docs/user_guide/start_stop.md edit: also https://github.com/cloudflare/pingora/blob/main/docs/user_guide/graceful.md
@jamesmunns I believe Pingora's Graceful Upgrade is the way to go. Since Pingora is already battle tested in production, At this point I would rather rely on Pingora's way of doing it rather than bringing something new which may mature over time.

There is also a different Rust-based reverse proxy project with strong focus on changing configurations without any downtime or lost connections: https://github.com/sozu-proxy/sozu.

Quote from their website (https://github.com/sozu-proxy/sozu):

SŌZU is a HTTP reverse proxy built in Rust, that can handle fine grained configuration changes at runtime without reloads, and designed to never ever stop.

I am not sure how they achieve it exactly but their implementation might be worth looking into when designing this feature.

Noting that this has been scheduled for the just-starting-now milestone, should have some progress on this in the next weeks.

This was implemented by #49, please feel free to open an issue if there are any follow-on needs!