Websocket seems to be very slow

Question

Websocket seems to be very slow

ezdiy opened this issue 7 years ago · comments

Just to verify if its me, or also others: The client loads very slowly, but this is mainly because the server process gets hogged at 100%. I see a lot of small ws framing on both ends, most likely individual rpc calls.

Not sure if this is the inherent cause but node copes very badly with small network writes (it ping pongs between thread for reasons known only to node developers).

Mikey · Answer 1 · Sun Apr 16 2017 08:16:23 GMT+0800 (China Standard Time)

hey @ezdiy, yeah it's not just you, it's slow as. 🐢

i'm not sure if this is necessarily because of the WebSocket transport. another possible option is if we're doing any crypto in the browser (which we are for attempting to decrypt private messages), it's using the pure JavaScript implementation which is at least 10-100x slower.

could be interesting to do some profiling. see 0x or devtool for profiling sbot, or any in browser profiler for that side.

ezdiy · Answer 2 · Mon Apr 17 2017 23:58:23 GMT+0800 (China Standard Time)

Devutils weren't revealing whatsover. 0x spitted out 100MB json and flamegraph took forever to render, but was somewhat more revealing.

If I read the graph correctly, it is related to how pull-reader et al protocol stack nukes performance of sbot in general - this is probably why sbot is 100% cpu hog when syncing while not doing very much. These packages appear to be based on misunderstanding of why Node streams are done the way they're done (those are designed to be fast even with high watermark in the queue). pull-* is faster with shallow queues, but atrocious with deep queues. JS has no TCO by default and the deep call stacks+setImmediates() are very slow.

Websocket seems to pronounce it dramatically because web browser is a more sluggish ws frame consumer, thus it triggers deeper queues and thus greater nested calls overhead in pull-stream.

A bandaid would be trying to do TCO as much as possible - use 'strict', node --harmony_tailcalls and be careful to never introduce try {} deopt on queue call stack. It's a tough call. In the long term, dropping the NIH junk and go with regular Node-style queues with arrays would be preferable.

cc @dominictarr

ezdiy · Answer 3 · Tue Apr 18 2017 01:37:34 GMT+0800 (China Standard Time)

The above is relevant if I read the flamegraph correctly:

First big column: flume
Second big column: json parser
Third column: direct websocket stream stuff
Fourth column: pull-stuff (most time seems to be spent dealing with ws)

Here is the flamegraph itself: http://pub.lua.cz/fg/flamegraph.html

Dominic Tarr · Answer 4 · Tue Apr 18 2017 07:10:33 GMT+0800 (China Standard Time)

The many small ws packets could be fixed with a thing that applies something like nagel's algorithm after going through encryption, that will make fewer larger packets.

The other thing, is that currently replication means 2k createHistoryStream calls going either way. That will be fixed by rolling out EBT.

In the short term, there is a bug where the pub tries to replicate with the client, we can fix that on the master branch by having the server not call a createHistoryStream unless the client calls first. This code is already in the flume branch, just needs to be cherrypicked.

@ezdiy btw, I made pull-stream after a lot of experience with node streams.

ezdiy · Answer 5 · Tue Apr 18 2017 07:35:25 GMT+0800 (China Standard Time)

@dominictarr This runs on flume d05b0a6b310964c2e5fd32ad53278ff8db0603eb

Flume it is better than master, but not magnitudes better as one would expect.

This runs with global disabled and no seeds. Just connecting patchlite to it sends it on 100% cpu frenzy, and it takes a long while for node to even notice if patchlite disconnects right after. Its 100% busy doing something in pull-* for a long while. Suggesting that many ws objects are queued somewhere.

With single ebt seed instead of patchlite and few logs to sync, the pattern is same but it settles faster, after log entries are processed (at glacial pace of 5-10/s guessing from sbot replicate.changes). But the flame graph indeed looks similiar otherwise, just replicate.js in the stack root instead of ssb-ws.

Dominic Tarr · Answer 6 · Tue Apr 18 2017 11:46:56 GMT+0800 (China Standard Time)

thanks, I will try and reproduce this...

Dominic Tarr · Answer 7 · Tue Apr 18 2017 13:05:50 GMT+0800 (China Standard Time)

Okay, I have reproduced this!

This issue surprised me, because I had been using patchbay with JS crypto, and it was much more usable than this. It's mostly the same code, so i tried using websockets in electron 'bay by setting localStorage.remote="ws://localhost:8989~shs:EMovhfIrFk4NihAKnRNhrfRaqIhBv1Wj8pTxJNgvCCY=" (that for my sbot, use your pub key, obviously) that was still about the same.

Then, my guess is that it is trying to replicate with the client, which is sending heaps of (useless) requests back up the websocket. normally, it checks if the client is ourselves, https://github.com/ssbc/scuttlebot/blob/master/plugins/replicate.js#L242 and doesn't replicate. but if you use a different private key in the browser (as per liteclient instructions!) then it fails that check, and tries to replicate.

I confirmed this, by setting the localStorage['browser/.ssb/secret'] in electron, to the value you have in the browser. This made stock patchbay as slow as patchlite.

Dominic Tarr · Answer 8 · Tue Apr 18 2017 13:14:59 GMT+0800 (China Standard Time)

okay, so i tried using my regular key from patchlite (in firefox) and found it was still really slow, so websockets may still be a factor here (and/or ff)

Dominic Tarr · Answer 9 · Tue Apr 18 2017 15:17:15 GMT+0800 (China Standard Time)

okay, I wrote a small script to connect to my server via websockets... and over tcp, i see 3k+ messages per second, even with JS crypto enabled.

but over websockets... I see 10 times less. This is using websockets from node....
this could be because of pull-ws

Ahem, so it does use setImmediate https://github.com/pull-stream/pull-ws/blob/master/sink.js#L46-L48
we can replace that with https://www.npmjs.com/package/looper

Dominic Tarr · Answer 10 · Tue Apr 18 2017 15:37:32 GMT+0800 (China Standard Time)

oh, I tried commenting that out... but it seemed to make it slower. obvious some more work to do here...

Dominic Tarr · Answer 11 · Tue Apr 18 2017 16:20:53 GMT+0800 (China Standard Time)

for comparison, 6k messages a second is about 5mb a second,
so if websockets are doing 200 messages a second that is still 180k per second.
so probably we can also fix this my making the client-server thing less noisy.

other possible areas for improvement: shs adds a 34 byte header (2x 16 byte mac, plus 2 byte length) to each packet, so lots of short packets are a bad idea there, and also websockets...

I did a bit of investigation, and a benchmark and then I looked at the ws library docs

ws supports the permessage-deflate extension extension which enables the client and server to negotiate a compression algorithm and its parameters, and then selectively apply it to the data payloads of each WebSocket message.

uh, well that sounds like a bad idea!

The extension is enabled by default but adds a significant overhead in terms of performance and memory comsumption. We suggest to use WebSocket compression only if it is really needed.

emphasis mine 🤦‍♂️

I disabled that and could dump my sbot log in 44 seconds.
Now it's only twice as slow as tcp, not ten times as slow.

Dominic Tarr · Answer 12 · Tue Apr 18 2017 16:25:14 GMT+0800 (China Standard Time)

update to pull-ws@3.2.9 (you should get this on a fresh install)

Mikey · Answer 13 · Tue Apr 18 2017 18:37:28 GMT+0800 (China Standard Time)

while talking about WebSocket perfomance, have y'all heard of µWS? apparently legit.

Dominic Tarr · Answer 14 · Tue Apr 18 2017 19:24:19 GMT+0800 (China Standard Time)

I had a quick look at that, to see how it would do in the benchmark branch of pull-ws, but it wasn't a drop in replacement, and I have other stuff I need to work on right now

ezdiy · Answer 15 · Tue Apr 18 2017 21:05:38 GMT+0800 (China Standard Time)

@dominictarr

Now it's only twice as slow as tcp, not ten times as slow

Nice, this looks promising!

stale · Answer 16 · Fri Nov 02 2018 00:43:02 GMT+0800 (China Standard Time)

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.