raiden-network / raiden

Raiden Network

Home Page:https://developer.raiden.network

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sync-call to matrix potentially stuck

ezdac opened this issue · comments

Problem

The sync-worker is sometimes stuck in a http-request since very recently.

Initial encounter

We saw that the sync-worker got stuck during testing when @istankovic was trying to initiate a direct transfer around 2021-06-08 09:45:

"2021-06-08 08:52:54.845368 api.sync returned"
"2021-06-08 08:52:54.848562 Calling api.sync"
"2021-06-08 08:52:55.134066 api.sync returned"
"2021-06-08 08:52:55.137238 Calling api.sync"

Where a expected following api.sync returned was not seen

log.debug(
"Calling api.sync",
node=node_address_from_userid(self.user_id),
user_id=self.user_id,
sync_iteration=self.sync_iteration,
time_since_last_sync_in_seconds=time_since_last_sync_in_seconds,
)
self.last_sync = time_before_sync
response = self.api.sync(
since=self.sync_token, timeout_ms=timeout_ms, filter=self._sync_filter_id
)
time_after_sync = time.monotonic()
log.debug(
"api.sync returned",
node=node_address_from_userid(self.user_id),
user_id=self.user_id,
sync_iteration=self.sync_iteration,
time_after_sync=time_after_sync,
time_taken=time_after_sync - time_before_sync,
)

We saw the address switching to offline at 2021-06-08 08:54:15.281491 UTC in the PFS - the last log for that address before 9:00 was:

synapse_1            | 2021-06-08 08:53:42,787 synapse.access.http.8008(387) [INFO    ]: GET-562351 46.90.91.92 - 8008 - {@0x8c31290358f6855baf060e6fa7b11517444641e3:transport.raiden.overdoze.se} Processed request: 60.004sec/-0.000sec (0.004sec, 0.000sec) (0.000sec/0.000sec/0) 343B 200 "GET /_matrix/client/r0/sync?timeout=60000&since=s1_129573_0_1_1_104_74888_46_1&access_token=<redacted> HTTP/1.1" "Raiden 2.0.0rc2" [0 dbevts]

And the next occurrence was:

synapse_1            | 2021-06-08 09:45:24,692 synapse.access.http.8008(387) [INFO    ]: PUT-563611 46.90.91.92 - 8008 - {@0x8c31290358f6855baf060e6fa7b11517444641e3:transport.raiden.overdoze.se} Processed request: 0.011sec/-0.000sec (0.003sec, 0.000sec) (0.002sec/0.003sec/3) 2B 200 "PUT /_matrix/client/r0/sendToDevice/m.room.message/30201623145536848?access_token=<redacted> HTTP/1.1" "Raiden 2.0.0rc2" [0 dbevts]

Investigating the greenlets, we found that the sync-worker was stuck while waiting for bytes on the socket during the request to the matrix server's sync endpoint:

+--- <Greenlet "GMatrixClient.sync_worker user_id:@0x8c31290358f6855baf060e6fa7b11517444641e3:transport.raiden.overdoze.se" at 0x7f9b5575ed00: <bound method GMatrixClient.listen_forever of <raiden.network.transport.matrix.client.GMatrixClient object at 0x7f9b5608d760>>(60000, 15000, None)>
 :          Parent: <Hub '' at 0x7f9b591a1640 epoll default pending=1 ref=2 fileno=4 resolver=<gevent.resolver.thread.Resolver at 0x7f9b56bf9e80 pool=<ThreadPool at 0x7f9b56b9f880 tasks=0 size=3 maxsize=10 hub=<Hub at 0x7f9b591a1640 thread_ident=0x7f9b6db60740>>> threadpool=<ThreadPool at 0x7f9b56b9f880 tasks=0 size=3 maxsize=10 hub=<Hub at 0x7f9b591a1640 thread_ident=0x7f9b6db60740>> thread_ident=0x7f9b6db60740>
 :          Running:
 :            File "/home/ivan/src/raiden/raiden/network/transport/matrix/client.py", line 410, in listen_forever
 :              self._sync(timeout_ms, latency_ms)
 :            File "/home/ivan/src/raiden/raiden/network/transport/matrix/client.py", line 650, in _sync
 :              response = self.api.sync(
 :            File "/home/ivan/venv-3.9/lib/python3.9/site-packages/matrix_client/api.py", line 104, in sync
 :              return self._send("GET", "/sync", query_params=request,
 :            File "/home/ivan/src/raiden/raiden/network/transport/matrix/client.py", line 178, in _send
 :              return super()._send(method, path, *args, **kwargs)
 :            File "/home/ivan/venv-3.9/lib/python3.9/site-packages/matrix_client/api.py", line 665, in _send
 :              response = self.session.request(
 :            File "/home/ivan/venv-3.9/lib/python3.9/site-packages/requests/sessions.py", line 542, in request
 :              resp = self.send(prep, **send_kwargs)
 :            File "/home/ivan/venv-3.9/lib/python3.9/site-packages/requests/sessions.py", line 655, in send
 :              r = adapter.send(request, **kwargs)
 :            File "/home/ivan/venv-3.9/lib/python3.9/site-packages/requests/adapters.py", line 439, in send
 :              resp = conn.urlopen(
 :            File "/home/ivan/venv-3.9/lib/python3.9/site-packages/urllib3/connectionpool.py", line 665, in urlopen
 :              httplib_response = self._make_request(
 :            File "/home/ivan/venv-3.9/lib/python3.9/site-packages/urllib3/connectionpool.py", line 416, in _make_request
 :              httplib_response = conn.getresponse()
 :            File "/usr/lib/python3.9/http/client.py", line 1345, in getresponse
 :              response.begin()
 :            File "/usr/lib/python3.9/http/client.py", line 307, in begin
 :              version, status, reason = self._read_status()
 :            File "/usr/lib/python3.9/http/client.py", line 268, in _read_status
 :              line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
 :            File "/usr/lib/python3.9/socket.py", line 704, in readinto
 :              return self._sock.recv_into(b)
 :            File "/home/ivan/venv-3.9/lib/python3.9/site-packages/gevent/_ssl3.py", line 567, in recv_into
 :              return self.read(nbytes, buffer)
 :            File "/home/ivan/venv-3.9/lib/python3.9/site-packages/gevent/_ssl3.py", line 390, in read
 :              self._wait(self._read_event, timeout_exc=_SSLErrorReadTimeout)
 :          Spawned at:
 :            File "/home/ivan/venv-3.9/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func
 :              return f(get_current_context(), *args, **kwargs)
 :            File "/home/ivan/src/raiden/raiden/ui/cli.py", line 575, in run
 :              return _run(ctx=ctx, **kwargs)
 :            File "/home/ivan/src/raiden/raiden/ui/cli.py", line 721, in _run
 :              run_services(kwargs)
 :            File "/home/ivan/src/raiden/raiden/ui/runners.py", line 18, in run_services
 :              raiden_service = run_raiden_service(**options)
 :            File "/home/ivan/src/raiden/raiden/ui/app.py", line 457, in run_raiden_service
 :              raiden_service.start()
 :            File "/home/ivan/src/raiden/raiden/raiden_service.py", line 480, in start
 :              self._start_transport()
 :            File "/home/ivan/src/raiden/raiden/raiden_service.py", line 598, in _start_transport
 :              self.transport.start(raiden_service=self, prev_auth_data=None)
 :            File "/home/ivan/src/raiden/raiden/network/transport/matrix/transport.py", line 557, in start
 :              self._initialize_sync()
 :            File "/home/ivan/src/raiden/raiden/network/transport/matrix/transport.py", line 856, in _initialize_sync
 :              self._client.start_listener_thread(
 :            File "/home/ivan/src/raiden/raiden/network/transport/matrix/client.py", line 464, in start_listener_thread
 :              self.sync_worker = gevent.spawn(

We believed this to be a rare occurrence.

Confirmation

Now, this was also occurring in a BF1 scenario (on Rinkeby testnet), which makes us believe that this is a systematic bug on our side.

root@yzma:/data/scenario-player/scenarios/bf1_basic_functionality/node_4910_003# cat run-4910.log | grep "Calling api.sync\|api.sync returned" |jq '[.event, .timestamp] | join(" ")'
...
"api.sync returned 2021-06-08 20:46:33.928846"
"Calling api.sync 2021-06-08 20:46:33.929727"
"api.sync returned 2021-06-08 20:47:33.940098"
"Calling api.sync 2021-06-08 20:47:33.941045"
"api.sync returned 2021-06-08 20:47:33.992759"
"Calling api.sync 2021-06-08 20:47:33.993626"

And after that only the blockchain event-fetching loop was running (no logs shown here), until we stopped the node:

"Copied state before applying state changes 2021-06-08 20:48:21.406133"
"Synchronizing blockchain events 2021-06-08 20:48:21.419843"
"State changes 2021-06-08 20:48:21.420674"
"Raiden events 2021-06-08 20:48:21.434734"
"Synchronized to a new confirmed block 2021-06-08 20:48:21.435058"
"Signal received. Shutting down. 2021-06-08 20:48:26.291081"
"REST API stopping 2021-06-08 20:48:26.291850"
"Idle 2021-06-08 20:48:26.689693"
"REST API stopped 2021-06-08 20:48:27.294251"
"Matrix stopping 2021-06-08 20:48:27.294788"
"Handling worker exiting, stop is set 2021-06-08 20:48:27.295311"
"Waiting on sync greenlet 2021-06-08 20:48:27.296350"
"Waiting on handle greenlet 2021-06-08 20:48:27.296712"
"Listener greenlet exited 2021-06-08 20:48:27.297105"
"Waiting on own greenlets 2021-06-08 20:48:27.298859"
"Transport performance report 2021-06-08 20:48:27.378184"
"Matrix stopped 2021-06-08 20:48:27.380301"
"Alarm task stopped 2021-06-08 20:48:27.381118"
"Raiden Service stopped 2021-06-08 20:48:27.382575"

So it seems like the api.sync in the sync-worker also didn't return properly.

Plan to action

  • Check SP environment's PFS logs
  • try to replicate the problem deterministically
  • Investigate possible deadlocks or async problems
  • check for known bugs in async / gevent libraries concerned with networking

Did you get the traceback including all greenlets? Can we confirm that the sync worker also got stuck on the same call waiting?

The node has been properly stopped and shut down, so I don't believe this would get us a traceback with the greenlet?
If I'm reading this correctly the sync-worker would be silently killed, but no exception would raise:

self.sync_worker.kill()
log.debug(
"Waiting on sync greenlet",
node=node_address_from_userid(self.user_id),
user_id=self.user_id,
)
exited = gevent.joinall({self.sync_worker}, timeout=SHUTDOWN_TIMEOUT, raise_error=True)
if not exited:
raise RuntimeError("Timeout waiting on sync greenlet during transport shutdown.")
self.sync_worker.get()

Maybe we should

This would lead to gevent (+asyncio) debugging info being printed to run-XXXX.stderr in the scenario player logs.

(See also [1])

What do you think?

unassigning myself for the time being -- I implemented the suggestions from #7141 (comment) which should give us more debugging information if this reappears in scenarios.

Let's close this and see if it happens again during manual testing.