K8s: CloudNative: Support SLB heath check.

Question

K8s: CloudNative: Support SLB heath check.

winlinvip opened this issue 4 years ago · comments

Winlin commented 4 years ago

Description'

Please ensure that the markdown structure is maintained.

Please describe the issue you encountered here.
'
Make sure to maintain the markdown structure.

SRS version: 3.0.112
The log of SRS is as follows:

[2020-02-12 11:08:02.825][Warn][1][430][107] accept client failed, err is code=1006 : fd2conn : ignore empty ip, fd=4749
thread [1][430]: accept_client() [src/app/srs_app_server.cpp:1165][errno=107]
thread [1][430]: fd2conn() [src/app/srs_app_server.cpp:1192][errno=107]

[2020-02-12 11:08:03.013][Warn][1][431][107] accept client failed, err is code=1006 : fd2conn : ignore empty ip, fd=4750
thread [1][431]: accept_client() [src/app/srs_app_server.cpp:1165][errno=107]
thread [1][431]: fd2conn() [src/app/srs_app_server.cpp:1192][errno=107]

Replay

How to replay bug?

Steps to reproduce the bug

Steps to reproduce the bug:

Load balancing service of cloud service, with SRS attached at the back.
Set health check as TCP or HTTP.
A large number of logs appear, approximately one every 100 milliseconds.

Expected behavior:

Support SLB health check, TCP or HTTP method, refer to Aliyun SLB Health Check.

TRANS_BY_GPT3

Winlin commented 3 years ago

Fixed

Winlin · Answer 1 · Wed Feb 12 2020 19:19:35 GMT+0800 (China Standard Time)

SRS3 will become the main version used within a certain period of time, and support for cloud native will be prioritized in SRS3, unless the changes are significantly large and affect stability.

TRANS_BY_GPT3

Winlin · Answer 2 · Sun Feb 16 2020 22:14:08 GMT+0800 (China Standard Time)

Currently, there is no problem with the TCP keep-alive detection of SLB, but it causes a large number of invalid logs for SRS.

[2020-02-16 14:00:24.542][Warn][1][471][107] accept client failed, err is code=1006 : fd2conn : ignore empty ip, fd=8288
thread [1][471]: accept_client() [src/app/srs_app_server.cpp:1165][errno=107]
thread [1][471]: fd2conn() [src/app/srs_app_server.cpp:1192][errno=107]

After filtering out these invalid logs, more than half of them are reduced.

-rw-r--r--   1 chengli.ycl  staff  2485201 Feb 16 22:00 t.log
-rw-r--r--   1 chengli.ycl  staff   242670 Feb 16 22:12 t2.log

TRANS_BY_GPT3

Winlin · Answer 3 · Sun Feb 16 2020 22:41:36 GMT+0800 (China Standard Time)

Logs must be collected centrally.
For example, if there is a problem with playing livestream.flv, you have to go to each edge to investigate.

Mac:srs chengli.ycl$ kubectl get po |grep edge
srs-edge-deploy-5cfd4b5b74-7hwfh      1/1     Running   0          75m
srs-edge-deploy-5cfd4b5b74-crgtn      1/1     Running   0          75m
srs-edge-deploy-5cfd4b5b74-gbzsp      1/1     Running   0          75m
srs-edge-deploy-5cfd4b5b74-rx856      1/1     Running   0          75m

Mac:srs.wiki chengli.ycl$ kubectl exec srs-edge-deploy-5cfd4b5b74-7hwfh grep 'livestream.flv' objs/srs.log
[2020-02-16 13:38:12.800][Trace][1][552] http: mount flv stream for sid=/live/livestream, mount=/live/livestream.flv
Mac:srs.wiki chengli.ycl$ kubectl exec srs-edge-deploy-5cfd4b5b74-crgtn grep 'livestream.flv' objs/srs.log
[2020-02-16 14:33:35.624][Trace][1][780] http: mount flv stream for sid=/live/livestream, mount=/live/livestream.flv
Mac:srs.wiki chengli.ycl$ kubectl exec srs-edge-deploy-5cfd4b5b74-gbzsp grep 'livestream.flv' objs/srs.log
command terminated with exit code 1
Mac:srs.wiki chengli.ycl$ kubectl exec srs-edge-deploy-5cfd4b5b74-rx856 grep 'livestream.flv' objs/srs.log
[2020-02-16 13:42:44.325][Trace][1][369] HTTP GET http://r.ossrs.net:8080/live/livestream.flv, content-length=-1
[2020-02-16 13:42:44.325][Trace][1][369] http: mount flv stream for sid=/live/livestream, mount=/live/livestream.flv
[2020-02-16 13:42:44.325][Trace][1][369] FLV /live/livestream.flv, encoder=FastFLV, nodelay=0, mw_sleep=350ms, cache=0, msgs=128
Mac:srs.wiki chengli.ycl$

Then, based on the timestamp, if the logs can be collected in SLS, it will be easy to search. You just need to input livestream.flv in SLS to find all the information about this stream on all nodes.

TRANS_BY_GPT3

Winlin · Answer 4 · Fri Feb 21 2020 13:19:36 GMT+0800 (China Standard Time)

The TCP keep-alive detection connection of SLB fails when retrieving information. It appears as follows in lsof:

COMMAND PID   USER   FD   TYPE  DEVICE SIZE/OFF    NODE NAME
srs     693 winlin   14u  sock     0,6      0t0 7163442 can't identify protocol

There was an error in obtaining the address of the file descriptor in SRS.

string srs_get_peer_ip(int fd)
{
    sockaddr_storage addr;
    socklen_t addrlen = sizeof(addr);
    if (getpeername(fd, (sockaddr*)&addr, &addrlen) == -1) {
        return "";

This will result in a large number of error messages.

Capture packets using tcpdump:

sudo tcpdump -i eth0 tcp port 2935 -w t.pcap

The SRS server IP is 172.17.1.57, and the SLB IP is 100.121.184.64:

From the above figure, it can be seen that 1-2-3-4 is one heartbeat, and the second packet is sent by the SRS server, after which the SLB immediately closes the connection. Then the second heartbeat is initiated with 5-6-7-8, with an interval of only 0.3 seconds (an SLB has around 10 LVS for detection), but the actual detection interval configured on the SLB is 2 seconds. For the health check mechanism, refer to: TCP Listening Health Check Mechanism

TRANS_BY_GPT3

Winlin · Answer 5 · Fri Feb 21 2020 23:51:32 GMT+0800 (China Standard Time)

Supports TCP-based health checks, which are enabled by default. This means that connections that fail to obtain an IP will be ignored.

# Whether client empty IP is ok, for example, health checking by SLB.
# If ok(on), we will ignore this connection without warnings or errors.
# default: on
empty_ip_ok on;

TRANS_BY_GPT3