When searching for co-workers in the origin server cluster, occasional inability to find the origin server nodes occurs.

Question

When searching for co-workers in the origin server cluster, occasional inability to find the origin server nodes occurs.

limjoe opened this issue 5 years ago · comments

Description
When compiling and installing version 3.0 alpha4 (3.0.71), there are occasional log reports of a connect error with code=3090. If this error continues to occur consecutively, it will result in the inability to stream.

Environment

Operating System: Ubuntu 16.04
SRS Version: 3.0 alpha4 (3.0.71)
Source Server A (192.100.20.20) Configuration File:

listen              1935;
max_connections     1000;
daemon              off;
srs_log_tank        console;

http_server {
    enabled         on;
    listen          18080;
    dir             ./objs/nginx/html;
}
http_api {
    enabled         on;
    listen          1985;
    crossdomain     on;
}

vhost push-pek-test.xxx.com {
    min_latency     on;
    tcp_nodelay     on;

    publish {
        mr off;
    }

    cluster {
        mode            local;
        origin_cluster  on;
        coworkers       192.100.20.9:1985;
    }

    hls {
        enabled         on;
        hls_fragment    6;
        hls_window      30;
        hls_path        ./objs/nginx/html;
        hls_m3u8_file   [app]/[stream].m3u8;
        hls_ts_file     [app]/[stream]/[timestamp].ts;
        hls_cleanup     on;
        hls_nb_notify   64;
        hls_wait_keyframe       on;

    }

    http_hooks {
        enabled         on;       
        on_hls          http://127.0.0.1:8086/v1/hls;
    }
}

Source Server B (192.100.20.9) Configuration File:

listen              1935;
max_connections     1000;
daemon              off;
srs_log_tank        console;

http_server {
    enabled         on;
    listen          18080;
    dir             ./objs/nginx/html;
}

http_api {    
    enabled         on;
    listen          1985;
    crossdomain     on;
}
vhost push-pek-test.xxx.com {

    min_latency     on;
    tcp_nodelay     on;

    publish {
        mr off;
    }

    cluster {
        mode            local;
        origin_cluster  on;
        coworkers       192.100.20.20:1985;
    }

    hls {
        enabled         on;
        hls_fragment    6;
        hls_window      30;
        hls_path        ./objs/nginx/html;
        hls_m3u8_file   [app]/[stream].m3u8;
        hls_ts_file     [app]/[stream]/[timestamp].ts;
        hls_cleanup     on;
        hls_nb_notify   64;
        hls_wait_keyframe       on;

    }
   http_hooks {
        enabled         on;       
        on_hls          http://127.0.0.1:8086/v1/hls;
    }
}

The log of SRS is as follows:

[2019-12-17 17:56:19.580][Error][29866][3627][11] connect error code=3090 : service cycle : rtmp: stream service 
: discover coworkers, url=http://192.100.20.9:1985/api/v1/clusters?vhost=push-pek-test.xxx.com&ip=push-pek-test.xxx.com&app=live&stream=IlYJs0kpFw7B&coworker=192.100.20.9:1985 
: parse data {"code":0,"data":{"query":{"ip":"push-pek-test.xxx.com","vhost":"push-pek-test.xxx.com","app":"live","stream":"IlYJs0kpFw7B"},"origin":null}}
thread [3627]: do_cycle() [src/app/srs_app_rtmp_conn.cpp:210][errno=11]
thread [3627]: service_cycle() [src/app/srs_app_rtmp_conn.cpp:400][errno=11]
thread [3627]: playing() [src/app/srs_app_rtmp_conn.cpp:616][errno=11]
thread [3627]: discover_co_workers() [src/app/srs_app_http_hooks.cpp:453][errno=11](Resource temporarily unavailable)

Edge configuration file

listen              1936;
pid                 ./objs/srs.1936.pid
max_connections     1000;
daemon              off;
srs_log_tank        console;

http_server {
    enabled         on;
    listen          18081;
    dir             ./objs/nginx/html;
}

http_api {
    enabled         on;
    listen          1986;
    crossdomain     on;
}

vhost play-pek-test.xxx.com {
    cluster {
        mode        remote;
        origin      192.100.20.9:1935 192.100.20.20:1935;
    }

    tcp_nodelay     on;
    min_latency     on;

    play {
        gop_cache       off;
        queue_length    10;
        mw_latency      100;
    }

    http_remux {
        enabled     on;
        mount       [vhost]/[app]/[stream].flv;
        hstrs       on;
    }
    vhost      push-pek-test.sensoro.com;
}

Reproduction
The steps to reproduce the bug are as follows:

Start SRS and run

./objs/srs -c conf/A.conf
./objs/srs -c conf/B.conf
./objs/srs -c conf/edge.conf

The bug has been reproduced, and the key information is as follows:

Frequent occurrence of [Error][29866][3918][11] connect error code=3090 : service cycle : rtmp: stream service : discover coworkers issue.

Sometimes it is able to successfully output the found logs.

 http: cluster redirect 192.100.20.
9:1935 ok, url=http://192.100.20.9:1985/api/v1/clusters?vhost=push-pek-test.xxx.com&ip=push-pek-test.xxx.com&app=live&stream=hJgeKBMPazyp&coworker=192.1
00.20.9:1985, response={"code":0,"data":{"query":{"ip":"push-pek-test.xxx.co
m","vhost":"push-pek-test.xxx.com","app":"live","stream":"hJgeKBMPazyp"},"or
igin":{"ip":"192.100.20.9","port":1935,"vhost":"push-pek-test.xxx.com","api"
:"192.100.20.9:1985","routers":["192.100.20.9:1985"]}}}

Expected Behavior
It is expected that the source nodes can find each other, stream normally, and the latency will not be affected by the number of source nodes in the polling process.

TRANS_BY_GPT3

Qiao Lin · Answer 1 · Tue Dec 17 2019 18:23:29 GMT+0800 (China Standard Time)

Currently, this error is being reported continuously. I have deployed 2 sets, and one of the environments in the source station cluster does not report this error.

TRANS_BY_GPT3

Winlin · Answer 2 · Thu Dec 19 2019 13:52:09 GMT+0800 (China Standard Time)

This problem may only occur when there are more than 2 servers in the source station cluster, for example:

origin serverA: 19350/9090, configure coworker as serverB/9091 and serverC/9092.
origin serverB: 19351/9091, configure coworker as serverA/9090 and serverC/9092.
origin serverC: 19352/9092, configure coworker as serverA/9090 and serverB/9091.

The configuration file has added a third origin server configuration, origin.cluster.serverC.conf, which can be used to reproduce this issue.

Start an edge server:

Edge server: 1935, origin server serverB/19351/9091.

Reproduction steps:

Push the stream to serverC/19352 and play the stream on the edge. The edge will fetch the stream from serverB/9091 as the origin.
ServerB will first ask serverA/9090 if there is a stream, and at this point, it returns an origin: null error.
To reproduce, you can debug serverB to identify this issue.

When ServerB is in SrsRtmpConn::playing, which means it is fetching the stream from the origin (edge), it will first ask ServerA if it has the stream. This is because ServerB does not have the stream itself.

http://127.0.0.1:9090/api/v1/clusters?vhost=__defaultVhost__&ip=127.0.0.1
&app=live&stream=livestream&coworker=127.0.0.1:9090

After finding that there is no stream, it directly returns an error. However, if it is origin:null, which clearly indicates that there is no stream, it should continue to ask the next origin server.

TRANS_BY_GPT3