When searching for co-workers in the origin server cluster, occasional inability to find the origin server nodes occurs.
limjoe opened this issue · comments
Description
When compiling and installing version 3.0 alpha4 (3.0.71), there are occasional log reports of a connect error with code=3090. If this error continues to occur consecutively, it will result in the inability to stream.
Environment
- Operating System:
Ubuntu 16.04
- SRS Version:
3.0 alpha4 (3.0.71)
- Source Server A (192.100.20.20) Configuration File:
listen 1935;
max_connections 1000;
daemon off;
srs_log_tank console;
http_server {
enabled on;
listen 18080;
dir ./objs/nginx/html;
}
http_api {
enabled on;
listen 1985;
crossdomain on;
}
vhost push-pek-test.xxx.com {
min_latency on;
tcp_nodelay on;
publish {
mr off;
}
cluster {
mode local;
origin_cluster on;
coworkers 192.100.20.9:1985;
}
hls {
enabled on;
hls_fragment 6;
hls_window 30;
hls_path ./objs/nginx/html;
hls_m3u8_file [app]/[stream].m3u8;
hls_ts_file [app]/[stream]/[timestamp].ts;
hls_cleanup on;
hls_nb_notify 64;
hls_wait_keyframe on;
}
http_hooks {
enabled on;
on_hls http://127.0.0.1:8086/v1/hls;
}
}
- Source Server B (192.100.20.9) Configuration File:
listen 1935;
max_connections 1000;
daemon off;
srs_log_tank console;
http_server {
enabled on;
listen 18080;
dir ./objs/nginx/html;
}
http_api {
enabled on;
listen 1985;
crossdomain on;
}
vhost push-pek-test.xxx.com {
min_latency on;
tcp_nodelay on;
publish {
mr off;
}
cluster {
mode local;
origin_cluster on;
coworkers 192.100.20.20:1985;
}
hls {
enabled on;
hls_fragment 6;
hls_window 30;
hls_path ./objs/nginx/html;
hls_m3u8_file [app]/[stream].m3u8;
hls_ts_file [app]/[stream]/[timestamp].ts;
hls_cleanup on;
hls_nb_notify 64;
hls_wait_keyframe on;
}
http_hooks {
enabled on;
on_hls http://127.0.0.1:8086/v1/hls;
}
}
- The log of SRS is as follows:
[2019-12-17 17:56:19.580][Error][29866][3627][11] connect error code=3090 : service cycle : rtmp: stream service
: discover coworkers, url=http://192.100.20.9:1985/api/v1/clusters?vhost=push-pek-test.xxx.com&ip=push-pek-test.xxx.com&app=live&stream=IlYJs0kpFw7B&coworker=192.100.20.9:1985
: parse data {"code":0,"data":{"query":{"ip":"push-pek-test.xxx.com","vhost":"push-pek-test.xxx.com","app":"live","stream":"IlYJs0kpFw7B"},"origin":null}}
thread [3627]: do_cycle() [src/app/srs_app_rtmp_conn.cpp:210][errno=11]
thread [3627]: service_cycle() [src/app/srs_app_rtmp_conn.cpp:400][errno=11]
thread [3627]: playing() [src/app/srs_app_rtmp_conn.cpp:616][errno=11]
thread [3627]: discover_co_workers() [src/app/srs_app_http_hooks.cpp:453][errno=11](Resource temporarily unavailable)
- Edge configuration file
listen 1936;
pid ./objs/srs.1936.pid
max_connections 1000;
daemon off;
srs_log_tank console;
http_server {
enabled on;
listen 18081;
dir ./objs/nginx/html;
}
http_api {
enabled on;
listen 1986;
crossdomain on;
}
vhost play-pek-test.xxx.com {
cluster {
mode remote;
origin 192.100.20.9:1935 192.100.20.20:1935;
}
tcp_nodelay on;
min_latency on;
play {
gop_cache off;
queue_length 10;
mw_latency 100;
}
http_remux {
enabled on;
mount [vhost]/[app]/[stream].flv;
hstrs on;
}
vhost push-pek-test.sensoro.com;
}
Reproduction
The steps to reproduce the bug are as follows:
- Start SRS and run
./objs/srs -c conf/A.conf
./objs/srs -c conf/B.conf
./objs/srs -c conf/edge.conf
- The bug has been reproduced, and the key information is as follows:
Frequent occurrence of [Error][29866][3918][11] connect error code=3090 : service cycle : rtmp: stream service : discover coworkers issue.
Sometimes it is able to successfully output the found logs.
http: cluster redirect 192.100.20.
9:1935 ok, url=http://192.100.20.9:1985/api/v1/clusters?vhost=push-pek-test.xxx.com&ip=push-pek-test.xxx.com&app=live&stream=hJgeKBMPazyp&coworker=192.1
00.20.9:1985, response={"code":0,"data":{"query":{"ip":"push-pek-test.xxx.co
m","vhost":"push-pek-test.xxx.com","app":"live","stream":"hJgeKBMPazyp"},"or
igin":{"ip":"192.100.20.9","port":1935,"vhost":"push-pek-test.xxx.com","api"
:"192.100.20.9:1985","routers":["192.100.20.9:1985"]}}}
Expected Behavior
It is expected that the source nodes can find each other, stream normally, and the latency will not be affected by the number of source nodes in the polling process.
TRANS_BY_GPT3
Currently, this error is being reported continuously. I have deployed 2 sets, and one of the environments in the source station cluster does not report this error.
TRANS_BY_GPT3
This problem may only occur when there are more than 2 servers in the source station cluster, for example:
- origin
serverA: 19350/9090
, configure coworker asserverB/9091
andserverC/9092
. - origin
serverB: 19351/9091
, configure coworker asserverA/9090
andserverC/9092
. - origin
serverC: 19352/9092
, configure coworker asserverA/9090
andserverB/9091
.
The configuration file has added a third origin server configuration,
origin.cluster.serverC.conf
, which can be used to reproduce this issue.
Start an edge server:
- Edge server: 1935, origin server
serverB/19351/9091
.
Reproduction steps:
- Push the stream to
serverC/19352
and play the stream on the edge. The edge will fetch the stream fromserverB/9091
as the origin. - ServerB will first ask
serverA/9090
if there is a stream, and at this point, it returns anorigin: null
error. - To reproduce, you can debug serverB to identify this issue.
When ServerB is in SrsRtmpConn::playing
, which means it is fetching the stream from the origin (edge), it will first ask ServerA if it has the stream. This is because ServerB does not have the stream itself.
http://127.0.0.1:9090/api/v1/clusters?vhost=__defaultVhost__&ip=127.0.0.1
&app=live&stream=livestream&coworker=127.0.0.1:9090
After finding that there is no stream, it directly returns an error. However, if it is origin:null
, which clearly indicates that there is no stream, it should continue to ask the next origin server.
TRANS_BY_GPT3