daurnimator / lua-http

HTTP Library for Lua. Supports HTTP(S) 1.0, 1.1 and 2.0; client and server.

Home Page:https://daurnimator.github.io/lua-http/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Websocket internally errors "/usr/local/share/lua/5.3/http/websocket.lua:282: read: Connection timed out"

Mehgugs opened this issue · comments

This happens when receiving payloads from the websocket.

Shard-0 has stopped receiving: ("/usr/local/share/lua/5.3/http/websocket.lua:282: read: Connection timed out\
stack traceback:\
\9[C]: in function 'assert'\
\9/usr/local/share/lua/5.3/http/websocket.lua:282: in function 'http.websocket.read_frame'\
\9/usr/local/share/lua/5.3/http/websocket.lua:394: in function </usr/local/share/lua/5.3/http/websocket.lua:386>\
\9[C]: in function 'xpcall'\
\9...s/marcus/documents/github/novus/lua/novus/shard/init.lua:183: in function 'novus.shard.messages'\
\9[C]: in function 'xpcall'\
\9.../marcus/documents/github/novus/lua/novus/client/init.lua:42: in function <.../marcus/documents/github/novus/lua/novus/client/init.lua:41>"

This happens regardless of if a timeout is passed to the socket.

The weird thing is when I request deflated messages (per message inflate/deflate inside my handling loop), I dont get this issue at all. The only time this happens is when I request raw json payloads. I'm using scm-0 from master (8ab5c30) to fix the timeout issues on the release rock.

This might be due to the payload size, heuristically I know it's the biggest payload I'm likely to receive causing this error, however this is not an issue with other clients for the api I'm connecting to.

Could you run with the environment variable SOCKET_DEBUG=1 set?
I'm wondering if your server is actually sending the data.

With SOCKET_DEBUG=1 it works correctly, and logs the payloads I expect to see, it only errors without the debug variable set.

it only errors without the debug variable set.

That's extremely odd. The debug variable should only add logging, not change any behaviour.
Maybe there is a race condition somewhere, and the extra time spent logging is changing the winner?

That seems likely, is there a way for me to collect the debug data into a file?

is there a way for me to collect the debug data into a file?

Just pipe/redirect it?

https://ghostbin.com/paste/a6czk You can see it cut off reading here. It is crashing consistently it seems upon further inspection. I think I jumped the gun a little when it managed to reconnect previously.

I added a xpcall handler to inspect the internals here is an example of a frame which errors:

{
    "RSV2": false,
    "RSV3": false,
    "RSV1": false,
    "MASK": false,
    "length": 12964,
    "FIN": true,
    "opcode": 1
}

Any idea what could cause this? When this happens my receive loop breaks down and I dont know if i can handle it without reconnecting.

Your pastebin link has now expired. Could you repost it? (consider using a gist?)

I'll try get some more data (I've lost the old log). What seems to happen is that the websocket cannot read the frame in 0 timeout https://github.com/daurnimator/lua-http/blob/master/http/websocket.lua#L282 I've made this line retry after a cqueues.sleep() and it can take up to 5 attempts to read.

I should also note that I've not found any further clues on how to reproduce it aside from connecting without requesting compression. I think it's possibly some kind of race condition?

Why should these reads be 0 timeout exactly? In practice this isn't readable without waiting.

if frame.MASK then
	local key = assert(sock:xread(4, "b", 0))
	frame.key = { key:byte(1, 4) }
end

do
	local data = assert(sock:xread(frame.length, "b", 0))
	if frame.MASK then
		frame.data = apply_mask(data, frame.key)
	else
		frame.data = data
	end
end

EDIT:

This is a very naive solution I have been using to try and get a handle on what happens:

local function should_retry(timer) if timer then return timer:status() == "pending" else return true end end

local function read_again(deadline, socket, ...)
    local tries = 0
    local data, msg, code = socket:xread(...)
    local timeout = deadline and deadline - monotime()
    local timer = timeout and promise.new(sleep, timeout)
    while should_retry(timer) and (not data or code == ce.ETIMEDOUT) do
        tries = tries + 1
        sleep()
        data, msg, code = socket:xread(...)
    end
    used_tries = max(used_tries, tries)
    return data, msg, code
end

Calling this function on the socket in place of the xreads and logging the used_tries :

websocket.read_again used 5324 retries.

This makes me think the read_frame is not consuming the deadline somewhere?

Why should these reads be 0 timeout exactly? In practice this isn't readable without waiting.

There is a preceding :fill that will ensure that there are enough bytes waiting in the socket buffer.

When the :fill fails, we poll for readability, and then tail call ourselves to try again:

return read_frame(sock, deadline)

Are you still having problems?
If I'm unable to replicate your issue I'll have to close it.

Hello, im so sorry i've not been more proactive with this. I'll try and get a reproducible example to work, and as far as im aware yes there's still something strange happening (it is mostly mitiaged right now by requesting compressed messages from the websocket but it by no means fixed).

I think being sent a large enough message over websocket is enough to trigger this.
EDIT:

Mon May 13 14:32:22 2019 INF  uncompressed payload size 6675863 bytes

It is usually this payload which causes this error when it is received by my client without compression.

Could you please come up with a test case that I can run?

i also have this issue
With litcord run the sample
It will say * BOT NAME ready. then /usr/local/share/lua/5.3/http/websocket.lua:282: read: Connection timed out

With litcord run the sample
It will say * BOT NAME ready. then /usr/local/share/lua/5.3/http/websocket.lua:282: read: Connection timed out

Could you write a reduced test case?
Trying to run that example I just get "Authentication failed".

You need to get a bot token by making a discord application.

Very odd.
I tried it many times today and it usually times out but sometimes works.
Seems to be discord having a very low timeout.

Someone more proficient at lua-http websockets might try setting up a simple server which sends a payload as big as what the server, discord in both cases here, sends.

commented

I'm getting the same error
Seems to happen when there's a relatively large amount of data being received (in this case, frame.length was around 17KB)
changing the timeout from 0 to 1 on line 282 (local data = assert(sock:xread(frame.length, "b", 0)) -> local data = assert(sock:xread(frame.length, "b", 1)))
seems to fix it (but I'm not sure if that works all the time, and maybe it causes other problems)

Got the same error
frame => { FIN = true, MASK = true, RSV1 = false, RSV2 = false, RSV3 = false, length = 1265, opcode = 1 }
assert (sock: xread (frame.length, "b", 0))
I got an error here.

The client version is RFC7692, is it irrelevant?

I have the same problem on websocket.lua:282: read: Connection timed out. i run the receive loop inside a cqueue container.

Hi there, I ran into this issue, here's a small test case to reproduce. I just use a public echo server to send a large-ish string of data (about 12kb) and check that I receive it.

local http_websocket = require'http.websocket'
local large_str = 'some text! '

for i=1,11 do
  large_str = large_str .. large_str
end

local client = http_websocket.new_from_uri('wss://echo.websocket.org:443')
client:connect()
client:send(large_str,'text')
local data, typ = client:receive()
assert(string.len(data) == string.len(large_str))

I have gained an interest in fixing this.

commented

Also been getting that exact same error, did not check that well yet but average response sizes are varying around 70kb but sometimes getting larger.
Not sure if it is just size here as I've not tested at all yet if I can consistently reproduce error with some specific response but thought I'll leave this here just for stats.

Any updates on this? I get the same issue. The server is in fact sending the packets, but the websocket interface simply times out. However, if I step through this slowly while debugging, I actually get the expected, correct result which leads me to think that this is some sort of race condition.

This issue, like the others above, only happens with large-ish payloads.

Thanks

Would you be able to write some code that launches a websocket server and reproduces this issue using a sample packet? It's really unreliable for me -- sometimes it happens, sometimes it doesn't.

Any updates on this? I get the same issue. The server is in fact sending the packets, but the websocket interface simply times out. However, if I step through this slowly while debugging, I actually get the expected, correct result which leads me to think that this is some sort of race condition.

This issue, like the others above, only happens with large-ish payloads.

Thanks

Sorry, just got back to this -- been pretty busy.
Yeah, I can try -- I think I can make something pretty simple ( a C# console application) to demonstrate this

Isn't there a + frame.length missing from here?

fill_length = fill_length - 2

The 8-byte length case below does have it, and indeed, "big messages" over 64k are received properly.

Edit: fwiw, adding the suggested bit of code makes the issue go away for me.

@LBPHacker
I think that fixed it for me.
Thank you so much!

Can we make a new build/luarocks package with this change?
@daurnimator