foxglove / ws-protocol

Foxglove Studio WebSocket protocol specification and libraries

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error on closing websocket connection

KerstinKeller opened this issue · comments

Description
Sometimes, when closing studio, I get the following error on the websocket server:

Task exception was never retrieved
future: <Task finished name='Task-4' coro=<handle_messages() done, defined at D:\python\ecal_live.py:200> exception=InvalidState('Cannot write to a WebSocket in the CLOSING state')>
Traceback (most recent call last):
  File "D:\python\ecal_live.py", line 203, in handle_messages
    await server.send_message(
  File "D:\.venv-3.10\lib\site-packages\foxglove_websocket\server\__init__.py", line 168, in send_message
    await self._send_message_data(
  File "D:\.venv-3.10\lib\site-packages\foxglove_websocket\server\__init__.py", line 193, in _send_message_data
    await connection.send([header, payload])
  File "D:\.venv-3.10\lib\site-packages\websockets\legacy\protocol.py", line 666, in send
    await self.write_frame(True, OP_CONT, b"")
  File "D:\.venv-3.10\lib\site-packages\websockets\legacy\protocol.py", line 1185, in write_frame
    raise InvalidState(
websockets.exceptions.InvalidState: Cannot write to a WebSocket in the CLOSING state
  • Version: python implementation, v0.0.7
  • Platform: Windows

Steps To Reproduce
Close the websocket connection while sending data

Expected Behavior
The send_message operation should not throw an exception.

@KerstinKeller I'm unable to reproduce this with the example server in https://github.com/foxglove/ws-protocol/blob/main/python/src/foxglove_websocket/examples/json_server.py. I tried modifying it to send larger messages, and send them at a higher rate. However when I repeatedly refresh studio to close/reestablish the connection, I never see this error.

Are you sure it is a foxglove_websocket bug or is it possible there's a threading bug in ecal_live.py? The InvalidState exception makes me think some asyncio internal assumptions are being violated by multithreaded code.

@jtbandes assigning so you can keep tabs on this and determine the path forward based on the response.

Hi @jtbandes it's something that happens sporadically, and I am not able to reproduce it at will.

I doubt threading is the culprit here, ecal_live.py, data is posted only from the main thread.

Roughly how often does it happen? Any tips on how to reproduce it?

It's really hard to tell. Maybe 1 in 20 times? Probably even less.
It's not particualarly bothering me since it does happen on close, however I wanted to file the issue for completeness.

Maybe I can attach the debugger in the future and get more info?
I am really sorry that I don't have any more clues on this 😞

I noticed that the path in the stacktrace has the word legacy which made me wonder if there are other code paths in the websocket module (https://github.com/aaugustin/websockets/tree/main/src/websockets) and it looks like there are. I don't know anything about the organization of this module or if this is a misleading path to explore but it would be interesting to figure out why something in the legacy code paths is being run. Maybe this is expected behavior.

Technically the legacy module is a private API but I know that it will be imported directly so I'll be careful if I ever decide to remove it. I don't have specific plans at this time. Keeping it "foverever" is an option if there are good reasons to do so.

(see python-websockets/websockets#975)

And from the callstack it seems that the legacy functions are called directly, even though not imported directly 🤔

What I found for info is only that the "cheat cheat" is outdated and that the tutorial is the place to go.

But still agree, without proper way to reproduce it, it might be very very hard to locate.

@KerstinKeller Have you had any more success reproducing this issue? It could also be something platform (windows) specific that surfaces with the websocket library. If there's no steps to reproduce or way to catch the error we'll move this over to discussions since there's nothing actionable here without a way to reproduce.

I tried the same tests in a Windows 11 VM (python 3.10.6) and am still unable to reproduce. I'm going to close this issue since it's not actionable for us. If you continue experiencing problems or find a way to make it happen more reliably, please let us know!