Support hot upgrade or smooth upgrade, Upgrade smoothly, Gracefully Upgrade, Source cleaning

Question

Support hot upgrade or smooth upgrade, Upgrade smoothly, Gracefully Upgrade, Source cleaning

winlinvip opened this issue 5 years ago · comments

Usage

SRS supports two signals:

SIGTERM: Fast exit, quickly clean up actively disconnected connections, and then exit. K8s sends this signal during preStop, and then sends SIGKILL to forcefully kill the Pod after a timeout. We can configure force_grace_quit to consider SIGTERM as Gracefully QUIT as well.
SIGQUIT: Graceful exit, close listening and wait for all clients to disconnect before exiting. If there are still connections, SRS will not exit, but the longest exit waiting time configuration in K8s is terminationGracePeriodSeconds, and it will force exit after waiting for this long. If there are no connections, it will wait for grace_final_wait before exiting.

Note: SRS does not implement a maximum waiting time. It will wait for clients to disconnect indefinitely without forcing an exit. In conjunction with the terminationGracePeriodSeconds configuration in K8s for managing Pods, K8s will send SIGKILL to forcefully shut down SRS after a timeout.

Other

In order to simplify the handling process, SRS does not clean up memory objects when stopping the stream, as the stream may be re-pushed. If cleaning is required, it would result in complex and careful handling of Source objects, which is not conducive to problem simplification.

Not cleaning up Source objects will cause continuous memory growth. This may not be a noticeable issue in scenarios where there is less streaming and more playback. However, in scenarios with a lot of streaming, such as monitoring and conference scenarios, cleaning up the streams becomes necessary. Reference:

PR for Source cleanup submitted by Nobody2 (#1568) discussed various scenarios that require cleaning up. Of course, Nobody2 did a great job with the submitted PR, but the issue itself is too complex.
Online reports (#1509, #1271, #1507) indicate memory leaks and OOM caused by not cleaning up Source objects.

Currently, partial optimizations have been implemented to alleviate this issue.

Resolve coroutine issue: link
Reduce memory usage in source: link

At the same time, we are also considering the most stable and easiest solution. There is another idea to make SRS support smooth exit and smooth upgrade, roughly as follows:

Disable exclusive access to the PID file, allowing a new SRS to be started.
Use REUSEPORT to open a new SRS, allowing both the old and new SRS to provide services using the same PID file.
The old SRS will no longer accept new connections and the API port will be closed. After serving existing clients or after a certain period of time, such as 12 hours, the old SRS will exit.

This way, the old SRS can easily and safely release the created sources and potential other memory issues. Users can smoothly upgrade and exit SRS during off-peak periods according to their business needs, minimizing the impact on users.

The only issue is that when both the new and old SRS are providing services, the API is provided by the new SRS, which means that the system count is not accurate, and the number of users served by the old SRS may be missed.

Remark: If it is a source station cluster, the stream is on the old SRS, which may result in the inability to detect the stream. In this case, it is necessary to forcefully disconnect the stream. The client needs to support retries in order to smoothly support this. One solution is to place an Edge before the source station, so that retries can be supported through the Edge.

TRANS_BY_GPT3

Winlin · Answer 1 · Sun Jan 19 2020 14:23:26 GMT+0800 (China Standard Time)

Users can choose:

According to their business situation, users can choose to smoothly upgrade and exit SRS during the low peak period of their business.
If there is a significant increase in the memory usage of SRS, exceeding the warning level, users can choose to urgently initiate a smooth upgrade.
Users can choose to upgrade in a planned manner after a new stable version is released, in order to avoid impacting a small number of existing users.

TRANS_BY_GPT3

Winlin · Answer 2 · Tue Feb 18 2020 09:40:54 GMT+0800 (China Standard Time)

When SRS supports K8S deployment, services need to support upgrades, rollbacks, and canary releases. The basic requirement for these mechanisms is that SRS needs to support Gracefully Quit/Upgrade. Only when SRS can do its part well, can K8S or other release mechanisms meet the requirements for production-level releases.

The SRS cluster is divided into Origin and Edge clusters, and this issue can be viewed separately.

The Origin cluster can be directly restarted because there is an Edge cluster that can retry. However, it can be improved when exiting, for example, not exiting abruptly at once, as this may cause the flow to be directed to another Origin server.
The Edge cluster cannot be directly restarted because it directly serves the clients. It can only exit after the service is finished. Whether it is a long connection or a short connection, the requirement is the same, only the duration differs.

Therefore, we focus on the issue of Gracefully Quit in the Edge cluster, which can refer to the mechanism of Nginx.

Update the Nginx binary.
Send the SIGUSR2 signal to Nginx.
The Nginx master modifies the PID file to /var/run/nginx.pid.oldbin, allowing the new master to start. This file can also be used to send signals to the old master.
Start the new master using execve, with the PID set to /var/run/nginx.pid. Pass the listen file descriptor to the new master through the ENV, allowing both the old and new masters to listen on the same port.
Send the SIGWINCH signal to the old master to gracefully terminate the workers after serving the existing file descriptors. Meanwhile, the new master is already working, and the new workers are serving new connections.
Send the SIGQUIT signal to the old master to initiate a graceful shutdown.
After a certain period, the old master can also be sent the SIGTERM signal to exit directly.

Since Nginx chooses to start the master using execve, inheriting the listen file descriptor, this process can be more complex. SRS can choose to use REUSEPORT to directly start a new process listening on the same file descriptor, making this solution simpler.

Additionally, SRS3 has been released with the following plans:

SRS3 supports some key features that require script coordination or K8S management.
SRS4 will provide improved support for Gracefully Upgrade and offer more comprehensive features.

TRANS_BY_GPT3

Winlin · Answer 3 · Tue Feb 18 2020 14:49:14 GMT+0800 (China Standard Time)

To publish updates, rollbacks, and gray releases, there are two main requirements for SRS:

Gracefully Quit: Smoothly exit by closing the listening port, no longer accepting new connections, and waiting for existing connections to end before quitting.
Gracefully Upgrade: Smoothly upgrade by starting a new SRS instance while the old one continues to run. The old instance will begin a Gracefully Quit process to smoothly exit.

The key to Gracefully Quit is to no longer accept new connections and wait for the existing connections to exit. We can achieve this by closing the listening file descriptor (fd) in SRS. Another approach is to remove the backend Pod from the SLB (Server Load Balancer), which will naturally prevent new fds from being created.

TRANS_BY_GPT3

Winlin · Answer 4 · Tue Feb 18 2020 19:25:35 GMT+0800 (China Standard Time)

SRS adds a new signal: SIGQUIT, which stands for Gracefully QUIT. It allows for a smooth exit by closing the listening file descriptor (FD) and waiting for existing connections to finish before exiting.

Finally, it will wait for a certain period of time, by default 3.2 seconds, to allow for the completion of the final cleanup. For example, if there are no connections, only the listening needs to be closed.


[root@55233a151f96 trunk]# netstat -anp|grep srs
tcp        0      0 0.0.0.0:1985            0.0.0.0:*               LISTEN      5698/./objs/srs     
tcp        0      0 0.0.0.0:1935            0.0.0.0:*               LISTEN      5698/./objs/srs     
tcp        0      0 0.0.0.0:8080            0.0.0.0:*               LISTEN      5698/./objs/srs     

[root@55233a151f96 trunk]# killall -s SIGQUIT srs
[root@55233a151f96 trunk]# netstat -anp|grep srs
[root@55233a151f96 trunk]# 

[2020-02-18 11:07:21.529][Trace][5698][700] cleanup for quit signal fast=0, grace=1
[2020-02-18 11:07:21.530][Warn][5698][700][11] main cycle terminated, system quit normally.
[2020-02-18 11:07:26.740][Trace][5698][700] final wait for another 5200ms
[2020-02-18 11:07:26.740][Trace][5698][700] srs gracefully quit

When there are connections, it will keep waiting.

[root@55233a151f96 trunk]# netstat -anp|grep srs
tcp        0      0 0.0.0.0:1985            0.0.0.0:*               LISTEN      5776/./objs/srs     
tcp        0      0 0.0.0.0:1935            0.0.0.0:*               LISTEN      5776/./objs/srs     
tcp        0      0 0.0.0.0:8080            0.0.0.0:*               LISTEN      5776/./objs/srs     
tcp        0      0 172.17.0.2:1935         172.17.0.1:36840        ESTABLISHED 5776/./objs/srs     

[root@55233a151f96 trunk]# killall -s SIGQUIT srs
[root@55233a151f96 trunk]# netstat -anp|grep srs
tcp        0      0 172.17.0.2:1935         172.17.0.1:36840        ESTABLISHED 5776/./objs/srs 

[2020-02-18 11:09:57.356][Trace][5776][516] cleanup for quit signal fast=0, grace=1
[2020-02-18 11:09:57.356][Warn][5776][516][11] main cycle terminated, system quit normally.
[2020-02-18 11:09:58.382][Trace][5776][516] wait for 1 conns to quit
[2020-02-18 11:10:00.459][Trace][5776][516] wait for 1 conns to quit

You can see that the listening connection is closed, but the service connection is still not closed. SRS will only exit after this streaming connection is finished.

Add a new configuration for the waiting time before exiting, with a default value of 3.2 seconds.

# for gracefully quit, final wait for cleanup in milliseconds.
# default: 3200
grace_final_wait 3200;

Note: For K8S, it is also necessary to enable force_grace_quit, please refer to force_grace_quit

TRANS_BY_GPT3

Winlin · Answer 5 · Tue Feb 18 2020 22:03:56 GMT+0800 (China Standard Time)

We also need a configuration because when K8S calls preStop, it sends a SIGTERM signal to SRS. SIGTERM is a fast quit signal that causes SRS to exit quickly. Even during the Gracefully Quit period, SRS will handle this signal. Therefore, it is necessary to configure SRS to consider SIGTERM as a gracefully quit signal.

# Whether force gracefully quit, never fast quit.
# By default, SIGTERM which means fast quit, is sent by K8S, so we need to
# force SRS to treat SIGTERM as gracefully quit for gray release or canary.
# default: off
force_grace_quit off;

By default, it is not enabled, which means that SRS will exit when it receives a SIGTERM signal. This is suitable for general scenarios, such as origin servers or situations where smooth upgrades are not required.

TRANS_BY_GPT3

Winlin · Answer 6 · Fri Feb 21 2020 12:55:54 GMT+0800 (China Standard Time)

SRS3 already supports graceful shutdown. It can also support smooth upgrades in the K8S and SLB architectures. Please refer to: https://github.com/ossrs/srs/wiki/v4_CN_K8s#srs-cluster-update-rollback-gray-release-with-zero-downtime

TRANS_BY_GPT3

Winlin · Answer 7 · Tue Dec 01 2020 13:36:18 GMT+0800 (China Standard Time)

Just need to clean up one Source, as described in other Issues:'

Make sure to maintain the markdown structure.

#413, support Source cleaning, but it was revoked due to multiple issues. It will be improved and resolved in the future.
You can consider using Gracefully Quit to smoothly exit (#1579 (comment)), restart the service when there is no traffic, and temporarily bypass this problem.'

Make sure to maintain the markdown structure.

For more progress, please refer to: #413

Make sure to maintain the markdown structure.

TRANS_BY_GPT3