ossrs / srs

SRS is a simple, high-efficiency, real-time video server supporting RTMP, WebRTC, HLS, HTTP-FLV, SRT, MPEG-DASH, and GB28181.

Home Page:https://ossrs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Edge pull, the system often encounters errors ret=1018 (Device or resource busy) and ret=1018 (No such file or directory).

gqf2008 opened this issue · comments

Journal

[2015-10-26 13:36:08.153][error][13948][3998][16] http post on_play uri failed. client_id=3998, url=http://127.0.0.1/lcs/api/rtmp/on_play/shnh-edge2-live1.evideocloud.net, request={"action":"on_play","client_id":3998,"ip":"222.88.95.177","vhost":"live1.evideocloud.net","app":"live","stream":"dxhdbh__gBML4E6B40Lv","pageUrl":""}, response=, code=0, ret=1018(Device or resource busy)
[2015-10-26 13:36:08.153][error][13948][3998][16] hook client on_play failed. url=http://127.0.0.1/lcs/api/rtmp/on_play/shnh-edge2-live1.evideocloud.net, ret=1018(Device or resource busy)
[2015-10-26 13:36:08.153][error][13948][3998][16] http hook on_play failed. ret=1018(Device or resource busy)
[2015-10-26 13:36:08.153][error][13948][3998][16] stream service cycle failed. ret=1018(Device or resource busy)
[2015-10-26 13:36:08.154][error][13948][3998][2] connect to server error. ip=127.0.0.1, port=80, ret=1018(No such file or directory)
[2015-10-26 13:36:08.342][error][13948][4000][104] rtmp handshake failed. ret=1008(Connection reset by peer)
[2015-10-26 13:36:09.032][error][13948][3990][2] connect to server error. ip=127.0.0.1, port=80, ret=1018(No such file or directory)
[2015-10-26 13:36:09.032][error][13948][3990][2] connect to server error. ip=127.0.0.1, port=80, ret=1018(No such file or directory)
[2015-10-26 13:36:09.251][error][13948][4002][2] connect to server error. ip=127.0.0.1, port=80, ret=1018(No such file or directory)
[2015-10-26 13:36:09.251][error][13948][4002][16] http post on_connect uri failed. client_id=4002, url=http://127.0.0.1/lcs/api/rtmp/on_connect/shnh-edge2-live1.evideocloud.net, request={"action":"on_connect","client_id":4002,"ip":"222.186.130.3","vhost":"live1.evideocloud.net","app":"live","tcUrl":"rtmp://live1.evideocloud.net:1935/live","pageUrl":""}, response=, code=0, ret=1018(Device or resource busy)
[2015-10-26 13:36:09.251][error][13948][4002][16] hook client on_connect failed. url=http://127.0.0.1/lcs/api/rtmp/on_connect/shnh-edge2-live1.evideocloud.net, ret=1018(Device or resource busy)
[2015-10-26 13:36:09.251][error][13948][4002][16] check vhost failed. ret=1018(Device or resource busy)
[2015-10-26 13:36:10.341][error][13948][3973][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)
[2015-10-26 13:36:12.165][error][13948][4020][104] rtmp handshake failed. ret=1008(Connection reset by peer)
[2015-10-26 13:36:13.365][error][13948][4022][104] rtmp handshake failed. ret=1008(Connection reset by peer)
[2015-10-26 13:36:14.103][error][13948][4008][2] connect to server error. ip=127.0.0.1, port=80, ret=1018(No such file or directory)
[2015-10-26 13:36:14.103][error][13948][4008][2] connect to server error. ip=127.0.0.1, port=80, ret=1018(No such file or directory)
[2015-10-26 13:36:14.572][error][13948][4005][2] connect to server error. ip=127.0.0.1, port=80, ret=1018(No such file or directory)
[2015-10-26 13:36:14.572][error][13948][4005][2] connect to server error. ip=127.0.0.1, port=80, ret=1018(No such file or directory)
[2015-10-26 13:36:14.614][error][13948][4024][2] connect to server error. ip=127.0.0.1, port=80, ret=1018(No such file or directory)
[2015-10-26 13:36:14.614][error][13948][4024][16] http post on_connect uri failed. client_id=4024, url=http://127.0.0.1/lcs/api/rtmp/on_connect/shnh-edge2-live1.evideocloud.net, request={"action":"on_connect","client_id":4024,"ip":"222.186.130.3","vhost":"live1.evideocloud.net","app":"live","tcUrl":"rtmp://live1.evideocloud.net:1935/live","pageUrl":""}, response=, code=0, ret=1018(Device or resource busy)
[2015-10-26 13:36:14.614][error][13948][4024][16] hook client on_connect failed. url=http://127.0.0.1/lcs/api/rtmp/on_connect/shnh-edge2-live1.evideocloud.net, ret=1018(Device or resource busy)
[2015-10-26 13:36:14.614][error][13948][4024][16] check vhost failed. ret=1018(Device or resource busy)
[2015-10-26 13:36:15.211][error][13948][3914][62] rtmp handshake failed. ret=1011(Timer expired)
[2015-10-26 13:36:15.256][error][13948][4026][2] connect to server error. ip=127.0.0.1, port=80, ret=1018(No such file or directory)
[2015-10-26 13:36:15.256][error][13948][4026][16] http post on_connect uri failed. client_id=4026, url=http://127.0.0.1/lcs/api/rtmp/on_connect/shnh-edge2-live1.evideocloud.net, request={"action":"on_connect","client_id":4026,"ip":"222.186.130.3","vhost":"live1.evideocloud.net","app":"live","tcUrl":"rtmp://live1.evideocloud.net:1935/live","pageUrl":""}, response=, code=0, ret=1018(Device or resource busy)
[2015-10-26 13:36:15.256][error][13948][4026][16] hook client on_connect failed. url=http://127.0.0.1/lcs/api/rtmp/on_connect/shnh-edge2-live1.evideocloud.net, ret=1018(Device or resource busy)


[2015-10-26 13:38:13.494][error][13948][4448][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)
[2015-10-26 13:38:13.565][error][13948][4452][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)
[2015-10-26 13:38:13.767][error][13948][4393][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)
[2015-10-26 13:38:13.783][error][13948][4374][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)
[2015-10-26 13:38:13.855][error][13948][4385][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)
[2015-10-26 13:38:14.494][error][13948][4448][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)
[2015-10-26 13:38:14.565][error][13948][4452][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)
[2015-10-26 13:38:14.767][error][13948][4393][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)
[2015-10-26 13:38:14.783][error][13948][4374][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)
[2015-10-26 13:38:14.855][error][13948][4385][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)
[2015-10-26 13:38:15.494][error][13948][4448][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)
[2015-10-26 13:38:15.565][error][13948][4452][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)
[2015-10-26 13:38:15.767][error][13948][4393][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)
[2015-10-26 13:38:15.783][error][13948][4374][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)
[2015-10-26 13:38:15.855][error][13948][4385][2] connect to server error. ip=192.168.190.34, port=1935, ret=1018(No such file or directory)

TRANS_BY_GPT3

#define ERROR_ST_CONNECT                    1018

This error means "cannot connect to the server.

TRANS_BY_GPT3

Please specify the version, environment, and reproduction method.

TRANS_BY_GPT3

Version: 2.0.195
Environment: CentOS 6.2 64 bit

Reproduction method:

  1. One publishing server is responsible for publishing only and only allows two edge servers to pull streams. The publishing server only hooks the connect publish close callbacks.
  2. When there is a playback request, the two edge servers pull the stream from the publishing server. The edge servers only hook the connect play close callbacks.
  3. At first, it was thought that there was a problem with the implementation of the hooked API, so the hooks were completely disabled. However, through log observation, the same problem was found when connecting to port 1935 of the publishing server.
  4. The two edge servers have F5 in front (triangle transmission) and do port detection/S on port 1935.
  5. The number of concurrent connections on the edge servers does not exceed 100.
  6. This problem can be reproduced by continuously clicking play and stop in the VLC player, and VLC reports an error of being unable to connect to the backend.

TRANS_BY_GPT3

By querying relevant technical documents, it is said that after setting the socket to non-blocking, calling the recv function before the data packet is sent will result in this error. The program needs to ignore this error and continue looping to read. I hope this information is helpful in fixing this issue. Thank you.

TRANS_BY_GPT3

Well, let me see. It seems to be saying that it cannot connect to your http callback server. It shouldn't be a problem with recv yet.

TRANS_BY_GPT3

After closing the hook, it is still the same, connecting to the publishing server also reports 1018 (No such file or directory).

TRANS_BY_GPT3

Is the release server SRS?

TRANS_BY_GPT3

Yes.

TRANS_BY_GPT3

When there is a 1018 (No such file or directory) error on a connection of the edge server, the client can still connect to the edge SRS and play, but there will be frame drops and freezing.

TRANS_BY_GPT3

Addendum, I am using version 2.0.197 and have set the srs as the Edge node.
mode remote;
The sourced stream can always be played normally. However, when accessing the SRS edge node, there is a more than 50% chance of error if the player disconnects and reconnects immediately. If the disconnection lasts for more than 10 seconds, there will be no error.

Upon investigating the code, the error message occurs in
SrsEdgeIngester::connect_server -> srs_socket_connect -> st_connect -> st_netfd_poll -> st_poll, and the returned errno is ENOENT (No such file or directory). Packet sniffing shows that communication is normal.

The problem seems to arise when the sourced connection is closed within a few seconds of no one watching after the last client disconnects. If there is a new stream playing for the first time during this period, it is prone to failure in the sourced connection. This issue persists in a loop and is also affected by different vhosts. This problem does not occur in version 1.0.

TRANS_BY_GPT3

At the same time, the same app's same flow has two edge connections simultaneously, occurring twice.
edge pull connected
After the occurrence of the log, there could be a "core" in the handshake of the origin connection. At that time, only the location was recorded and the "core" file was not saved. This should be the same issue causing the failure in the edge-to-origin connection mentioned above.
if (hs_bytes->s0s1s2[0] != 0x03) {

SrsComplexHandshake::handshake_with_server
SrsRtmpClient::handshake
SrsEdgeIngester::cycle()

TRANS_BY_GPT3

Having two reverse source connections in the same stream will definitely cause a crash.

TRANS_BY_GPT3

The accurate situation is that if the 'edge' node encounters repeated disconnections and quick reconnections from the same stream, it will enter into a loop error while pulling the stream from the third time onwards.
If it is a different stream, there will be several occurrences of 'ENOENT' (No such file or directory) for the origin fetching failure, followed by a return to normal.
Sometimes, when encountering two simultaneous origin fetches for the same stream, the program will crash.
It should be an issue with the origin fetching control.

TRANS_BY_GPT3

The Dragon God said that this problem is a thread issue. I have given him this bug.

TRANS_BY_GPT3

fixed in 2.0.199

Version 2.0.199 indeed fixed the aforementioned issues, but a new problem emerged, which occurs less frequently. Multiple edge flows disconnect almost simultaneously when the SRS thread is closed, causing a core dump.
#0 0x00000000004b3eb0 in internal::SrsThread::thread_cycle (this=0x3d292d0) at src/app/srs_app_thread.cpp:239
#1 0x00000000004b3f0f in internal::SrsThread::thread_fun (arg=0x3d292d0) at src/app/srs_app_thread.cpp:247
#2 0x000000000053703e in _st_thread_main () at sched.c:327
#3 0x00000000005377d8 in st_thread_create (start=0x537f52 <st_usleep+202>, arg=0x7fb94f755b40, joinable=32697, stk_size=1285439024) at sched.c:591
#4 0x00000000004b3959 in internal::SrsThread::start (this=<error reading variable: Cannot access memory at address 0x7ffffffd9>) at src/app/srs_app_thread.cpp:109

TRANS_BY_GPT3

@zhengfl Please summon the Dragon God.

TRANS_BY_GPT3

请问,,如果我现在下载下来最新2.0 release版本,还会存在这个问题吗?

You can try the latest version, which is 209: https://github.com/ossrs/srs/tree/2.0release#history.

TRANS_BY_GPT3

It seems like it was fixed in 2.0.203.

TRANS_BY_GPT3

ENOENT should be caused by a runaway thread, set by other threads.
Once the synchronization issue of this thread is resolved, there should be no more problems.

TRANS_BY_GPT3

This is because when close(stfd) is performed, it is not closed correctly, resulting in SRS disabling the feature: disconnecting all client connections when deleting a virtual host. This feature triggers the ENOENT issue.

The fd should not block on read and write during close.
Or in other words, an fd can only be closed by one thread, and the thread should finish cleaning up before closing again.

TRANS_BY_GPT3

2.0.211 fixed

Fly FD

FlyFD refers to the situation where FD runs away due to improper closure. When it flies away, it can lead to memory and FD leaks, or even the issue of FD being mysteriously closed. Therefore, FD should not fly, and the return value of close(stfd) must be 0, which we can ensure using assert.

How can we ensure that close(stfd) is correct? When closing stfd, it should not be in a waiting state for reading or writing. Consider if a thread is reading or writing stfd:

int osfd = ...; // create and open osfd.
st_netfd_t stfd = st_netfd_open_socket(osfd);
st_read(stfd, ...);
st_write(stfd, ...);
assert(0 == st_netfd_close(stfd)); // safely close it.

It is not possible for a single thread to be reading or writing when closing stfd. However, if multiple threads are involved, for example, one thread is responsible for receiving data, another thread is responsible for sending and processing, and they need to exit, then we need to create a separate thread:

int osfd = ...;
st_netfd_t stfd = st_netfd_open_socket(osfd);


st_thread_t tid = st_thread_create(function(){
    st_read(stfd, ...); // block here.
});


st_write(stfd, ...); 
assert(0 == st_netfd_close(stfd)); // failed and crash, stfd is is reading(EBUSY).

If the receiving thread is still active and the stfd is in the EBUSY state, it cannot be closed. To safely close it, the thread must be interrupted first:

st_thread_interrupt(tid);
assert(0 == st_netfd_close(stfd)); // safely close stfd.

Therefore, in the SRS, if there is a thread reading or writing to stfd, the thread must be stopped first before closing stfd, for example, in the case of the forwarder:



void SrsForwarder::on_unpublish()
{
    // @remark we must stop the thread then safely close the fd.
    pthread->stop();
    sdk->close();
}

If the order is reversed and the thread is stopped first before closing stfd, it will crash.

TRANS_BY_GPT3

st_thread_interrupt interrupts st_read and st_write.

ssize_t st_read(_st_netfd_t *fd, void *buf, size_t nbyte, st_utime_t timeout)
{
    ssize_t n;


    while ((n = read(fd->osfd, buf, nbyte)) < 0) {
if (errno == EINTR) { // This is a system interruption, ignore it.
            continue;
        }


        if (!_IO_NOT_READY_ERROR) {
            return -1;
        }


        /* Wait until the socket becomes readable */
        if (st_netfd_poll(fd, POLLIN, timeout) < 0) {
return -1; // When the thread is blocked here, st_thread_interrupt will cause it to return -1 (EINTR).
        }
    }


    return n;
}

If there is a system interrupt during the read system call, ST will retry it, and there is no problem with that.
If it is in a blocked state (i.e., waiting for fd to be readable in poll), calling st_thread_interrupt will cause this poll to return -1, errno=EINTR, which means that the blocked st_read/st_write will exit and the fd can be safely closed.

TRANS_BY_GPT3

commented

Winlin brother:
I just looked at your modification of srs_close_stfd(st_netfd_t& stfd). After compiling and testing, the program crashed. The crash happened at srs_assert(err != -1), where err is -1.
The reason for this is that forwarder->unpublish() is called twice. On the second call, the system method close() in srs_close_stfd returned -1.

      void SrsSource::destroy_forwarders()
{
    std::vector<SrsForwarder*>::iterator it;
    for (it = forwarders.begin(); it != forwarders.end(); ++it) {
        SrsForwarder* forwarder = *it;
        forwarder->on_unpublish();
srs_freep(forwarder); // The SrsForwarder destructor also calls unpublish() again.
    }
    forwarders.clear();
}

The stack trace is as follows:
First time: err=0

#0  st_netfd_close (fd=0x928670) at io.c:183
#1  0x00000000004c02d6 in srs_close_stfd (stfd=@0x9081a0) at src/app/srs_app_st.cpp:247
#2  0x00000000004a811c in SrsForwarder::close_underlayer_socket (this=0x908180)
    at src/app/srs_app_forward.cpp:336
#3  0x00000000004a70ee in SrsForwarder::on_unpublish (this=0x908180) at src/app/srs_app_forward.cpp:156
#4  0x00000000004964fa in SrsSource::destroy_forwarders (this=0x904a40)
    at src/app/srs_app_source.cpp:2771
#5  0x00000000004947d3 in SrsSource::on_unpublish (this=0x904a40, is_edge=false)
    at src/app/srs_app_source.cpp:2373

Second time: err=-1

#0  st_netfd_close (fd=0x928670) at io.c:183
#1  0x00000000004c02d6 in srs_close_stfd (stfd=@0x9081a0) at src/app/srs_app_st.cpp:247
#2  0x00000000004a811c in SrsForwarder::close_underlayer_socket (this=0x908180)
    at src/app/srs_app_forward.cpp:336
#3  0x00000000004a70ee in SrsForwarder::on_unpublish (this=0x908180) at src/app/srs_app_forward.cpp:156
#4  0x00000000004a67bd in SrsForwarder::~SrsForwarder (this=0x908180, __in_chrg=<value optimized out>)
    at src/app/srs_app_forward.cpp:71
#5  0x00000000004a69ee in SrsForwarder::~SrsForwarder (this=0x908180, __in_chrg=<value optimized out>)
    at src/app/srs_app_forward.cpp:80
#6  0x000000000049651f in SrsSource::destroy_forwarders (this=0x904a40)
    at src/app/srs_app_source.cpp:2772
#7  0x00000000004947d3 in SrsSource::on_unpublish (this=0x904a40, is_edge=false)
    at src/app/srs_app_source.cpp:2373

TRANS_BY_GPT3

fixed in 49853d2