ossrs / srs

SRS is a simple, high-efficiency, real-time video server supporting RTMP, WebRTC, HLS, HTTP-FLV, SRT, MPEG-DASH, and GB28181.

Home Page:https://ossrs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

K8s: CloudNative: Support SLB heath check.

winlinvip opened this issue · comments

Description'

Please ensure that the markdown structure is maintained.

Please describe the issue you encountered here.
'
Make sure to maintain the markdown structure.

  1. SRS version: 3.0.112
  2. The log of SRS is as follows:
[2020-02-12 11:08:02.825][Warn][1][430][107] accept client failed, err is code=1006 : fd2conn : ignore empty ip, fd=4749
thread [1][430]: accept_client() [src/app/srs_app_server.cpp:1165][errno=107]
thread [1][430]: fd2conn() [src/app/srs_app_server.cpp:1192][errno=107]

[2020-02-12 11:08:03.013][Warn][1][431][107] accept client failed, err is code=1006 : fd2conn : ignore empty ip, fd=4750
thread [1][431]: accept_client() [src/app/srs_app_server.cpp:1165][errno=107]
thread [1][431]: fd2conn() [src/app/srs_app_server.cpp:1192][errno=107]

Replay

How to replay bug?

Steps to reproduce the bug

Steps to reproduce the bug:

  1. Load balancing service of cloud service, with SRS attached at the back.
  2. Set health check as TCP or HTTP.
  3. A large number of logs appear, approximately one every 100 milliseconds.

Expected behavior:

Support SLB health check, TCP or HTTP method, refer to Aliyun SLB Health Check.

TRANS_BY_GPT3

SRS3 will become the main version used within a certain period of time, and support for cloud native will be prioritized in SRS3, unless the changes are significantly large and affect stability.

TRANS_BY_GPT3

Currently, there is no problem with the TCP keep-alive detection of SLB, but it causes a large number of invalid logs for SRS.

[2020-02-16 14:00:24.542][Warn][1][471][107] accept client failed, err is code=1006 : fd2conn : ignore empty ip, fd=8288
thread [1][471]: accept_client() [src/app/srs_app_server.cpp:1165][errno=107]
thread [1][471]: fd2conn() [src/app/srs_app_server.cpp:1192][errno=107]

After filtering out these invalid logs, more than half of them are reduced.

-rw-r--r--   1 chengli.ycl  staff  2485201 Feb 16 22:00 t.log
-rw-r--r--   1 chengli.ycl  staff   242670 Feb 16 22:12 t2.log

TRANS_BY_GPT3

Logs must be collected centrally.
For example, if there is a problem with playing livestream.flv, you have to go to each edge to investigate.

Mac:srs chengli.ycl$ kubectl get po |grep edge
srs-edge-deploy-5cfd4b5b74-7hwfh      1/1     Running   0          75m
srs-edge-deploy-5cfd4b5b74-crgtn      1/1     Running   0          75m
srs-edge-deploy-5cfd4b5b74-gbzsp      1/1     Running   0          75m
srs-edge-deploy-5cfd4b5b74-rx856      1/1     Running   0          75m

Mac:srs.wiki chengli.ycl$ kubectl exec srs-edge-deploy-5cfd4b5b74-7hwfh grep 'livestream.flv' objs/srs.log
[2020-02-16 13:38:12.800][Trace][1][552] http: mount flv stream for sid=/live/livestream, mount=/live/livestream.flv
Mac:srs.wiki chengli.ycl$ kubectl exec srs-edge-deploy-5cfd4b5b74-crgtn grep 'livestream.flv' objs/srs.log
[2020-02-16 14:33:35.624][Trace][1][780] http: mount flv stream for sid=/live/livestream, mount=/live/livestream.flv
Mac:srs.wiki chengli.ycl$ kubectl exec srs-edge-deploy-5cfd4b5b74-gbzsp grep 'livestream.flv' objs/srs.log
command terminated with exit code 1
Mac:srs.wiki chengli.ycl$ kubectl exec srs-edge-deploy-5cfd4b5b74-rx856 grep 'livestream.flv' objs/srs.log
[2020-02-16 13:42:44.325][Trace][1][369] HTTP GET http://r.ossrs.net:8080/live/livestream.flv, content-length=-1
[2020-02-16 13:42:44.325][Trace][1][369] http: mount flv stream for sid=/live/livestream, mount=/live/livestream.flv
[2020-02-16 13:42:44.325][Trace][1][369] FLV /live/livestream.flv, encoder=FastFLV, nodelay=0, mw_sleep=350ms, cache=0, msgs=128
Mac:srs.wiki chengli.ycl$ 

Then, based on the timestamp, if the logs can be collected in SLS, it will be easy to search. You just need to input livestream.flv in SLS to find all the information about this stream on all nodes.

TRANS_BY_GPT3

The TCP keep-alive detection connection of SLB fails when retrieving information. It appears as follows in lsof:

COMMAND PID   USER   FD   TYPE  DEVICE SIZE/OFF    NODE NAME
srs     693 winlin   14u  sock     0,6      0t0 7163442 can't identify protocol

There was an error in obtaining the address of the file descriptor in SRS.

string srs_get_peer_ip(int fd)
{
    sockaddr_storage addr;
    socklen_t addrlen = sizeof(addr);
    if (getpeername(fd, (sockaddr*)&addr, &addrlen) == -1) {
        return "";

This will result in a large number of error messages.

Capture packets using tcpdump:

sudo tcpdump -i eth0 tcp port 2935 -w t.pcap

The SRS server IP is 172.17.1.57, and the SLB IP is 100.121.184.64:

image

From the above figure, it can be seen that 1-2-3-4 is one heartbeat, and the second packet is sent by the SRS server, after which the SLB immediately closes the connection. Then the second heartbeat is initiated with 5-6-7-8, with an interval of only 0.3 seconds (an SLB has around 10 LVS for detection), but the actual detection interval configured on the SLB is 2 seconds. For the health check mechanism, refer to: TCP Listening Health Check Mechanism

TRANS_BY_GPT3

Supports TCP-based health checks, which are enabled by default. This means that connections that fail to obtain an IP will be ignored.

# Whether client empty IP is ok, for example, health checking by SLB.
# If ok(on), we will ignore this connection without warnings or errors.
# default: on
empty_ip_ok on;

TRANS_BY_GPT3

Fixed