K8s: CloudNative: Support SLB heath check.
winlinvip opened this issue · comments
Description'
Please ensure that the markdown structure is maintained.
Please describe the issue you encountered here.
'
Make sure to maintain the markdown structure.
- SRS version:
3.0.112
- The log of SRS is as follows:
[2020-02-12 11:08:02.825][Warn][1][430][107] accept client failed, err is code=1006 : fd2conn : ignore empty ip, fd=4749
thread [1][430]: accept_client() [src/app/srs_app_server.cpp:1165][errno=107]
thread [1][430]: fd2conn() [src/app/srs_app_server.cpp:1192][errno=107]
[2020-02-12 11:08:03.013][Warn][1][431][107] accept client failed, err is code=1006 : fd2conn : ignore empty ip, fd=4750
thread [1][431]: accept_client() [src/app/srs_app_server.cpp:1165][errno=107]
thread [1][431]: fd2conn() [src/app/srs_app_server.cpp:1192][errno=107]
Replay
How to replay bug?
Steps to reproduce the bug
Steps to reproduce the bug:
- Load balancing service of cloud service, with SRS attached at the back.
- Set health check as TCP or HTTP.
- A large number of logs appear, approximately one every 100 milliseconds.
Expected behavior:
Support SLB health check, TCP or HTTP method, refer to Aliyun SLB Health Check.
TRANS_BY_GPT3
SRS3 will become the main version used within a certain period of time, and support for cloud native will be prioritized in SRS3, unless the changes are significantly large and affect stability.
TRANS_BY_GPT3
Currently, there is no problem with the TCP keep-alive detection of SLB, but it causes a large number of invalid logs for SRS.
[2020-02-16 14:00:24.542][Warn][1][471][107] accept client failed, err is code=1006 : fd2conn : ignore empty ip, fd=8288
thread [1][471]: accept_client() [src/app/srs_app_server.cpp:1165][errno=107]
thread [1][471]: fd2conn() [src/app/srs_app_server.cpp:1192][errno=107]
After filtering out these invalid logs, more than half of them are reduced.
-rw-r--r-- 1 chengli.ycl staff 2485201 Feb 16 22:00 t.log
-rw-r--r-- 1 chengli.ycl staff 242670 Feb 16 22:12 t2.log
TRANS_BY_GPT3
Logs must be collected centrally.
For example, if there is a problem with playing livestream.flv, you have to go to each edge to investigate.
Mac:srs chengli.ycl$ kubectl get po |grep edge
srs-edge-deploy-5cfd4b5b74-7hwfh 1/1 Running 0 75m
srs-edge-deploy-5cfd4b5b74-crgtn 1/1 Running 0 75m
srs-edge-deploy-5cfd4b5b74-gbzsp 1/1 Running 0 75m
srs-edge-deploy-5cfd4b5b74-rx856 1/1 Running 0 75m
Mac:srs.wiki chengli.ycl$ kubectl exec srs-edge-deploy-5cfd4b5b74-7hwfh grep 'livestream.flv' objs/srs.log
[2020-02-16 13:38:12.800][Trace][1][552] http: mount flv stream for sid=/live/livestream, mount=/live/livestream.flv
Mac:srs.wiki chengli.ycl$ kubectl exec srs-edge-deploy-5cfd4b5b74-crgtn grep 'livestream.flv' objs/srs.log
[2020-02-16 14:33:35.624][Trace][1][780] http: mount flv stream for sid=/live/livestream, mount=/live/livestream.flv
Mac:srs.wiki chengli.ycl$ kubectl exec srs-edge-deploy-5cfd4b5b74-gbzsp grep 'livestream.flv' objs/srs.log
command terminated with exit code 1
Mac:srs.wiki chengli.ycl$ kubectl exec srs-edge-deploy-5cfd4b5b74-rx856 grep 'livestream.flv' objs/srs.log
[2020-02-16 13:42:44.325][Trace][1][369] HTTP GET http://r.ossrs.net:8080/live/livestream.flv, content-length=-1
[2020-02-16 13:42:44.325][Trace][1][369] http: mount flv stream for sid=/live/livestream, mount=/live/livestream.flv
[2020-02-16 13:42:44.325][Trace][1][369] FLV /live/livestream.flv, encoder=FastFLV, nodelay=0, mw_sleep=350ms, cache=0, msgs=128
Mac:srs.wiki chengli.ycl$
Then, based on the timestamp, if the logs can be collected in SLS, it will be easy to search. You just need to input livestream.flv
in SLS to find all the information about this stream on all nodes.
TRANS_BY_GPT3
The TCP keep-alive detection connection of SLB fails when retrieving information. It appears as follows in lsof:
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
srs 693 winlin 14u sock 0,6 0t0 7163442 can't identify protocol
There was an error in obtaining the address of the file descriptor in SRS.
string srs_get_peer_ip(int fd)
{
sockaddr_storage addr;
socklen_t addrlen = sizeof(addr);
if (getpeername(fd, (sockaddr*)&addr, &addrlen) == -1) {
return "";
This will result in a large number of error messages.
Capture packets using tcpdump:
sudo tcpdump -i eth0 tcp port 2935 -w t.pcap
The SRS server IP is 172.17.1.57
, and the SLB IP is 100.121.184.64
:
From the above figure, it can be seen that 1-2-3-4 is one heartbeat, and the second packet is sent by the SRS server, after which the SLB immediately closes the connection. Then the second heartbeat is initiated with 5-6-7-8, with an interval of only 0.3 seconds (an SLB has around 10 LVS for detection), but the actual detection interval configured on the SLB is 2 seconds. For the health check mechanism, refer to: TCP Listening Health Check Mechanism
TRANS_BY_GPT3
Supports TCP-based health checks, which are enabled by default. This means that connections that fail to obtain an IP will be ignored.
# Whether client empty IP is ok, for example, health checking by SLB.
# If ok(on), we will ignore this connection without warnings or errors.
# default: on
empty_ip_ok on;
TRANS_BY_GPT3
Fixed