Sync server scalability for live collaboration

Question

Sync server scalability for live collaboration

mweidner037 opened this issue 9 months ago · comments

I've been running benchmarks for rich-text editors with live collaboration. Using automerge with a modified automerge-repo-sync-server, I noticed performance issues starting at ~4 simultaneous users.

In each benchmark, the users connect to the server using BrowserWebSocketClientAdapter, wait for the server to share the document, and then start typing at 6 chars/sec (plus occasional formatting ops). They also record "remote latency": the time from when one user types some text until it shows up for the other users.

With 4 users, after about 90 sec of (continuous) typing, the server's CPU usage hits 100% and the remote latencies spike to ~20 seconds. With 8 users, the spike occurs within 30 seconds. I believe it depends on (# users) * (size of doc). Full time-series data: allActive-4-automergeRepo.csv, allActive-8-automergeRepo.csv

Variations:

If I use a basic WebSocket echo server instead of automerge-repo-sync-server, it scales to at least 16 simultaneous users.
When I ran similar benchmarks in August (v1.0.2), even 3 users would see a spike in remote latency, but now 3 users seem okay. So there has been recent improvement.

Reproduce

More details about the benchmarks are in Section 7.1 of this preprint. Server code we're using is here, client code is here.

You can easily run the benchmarks locally, although the data above is from running it on AWS (with clients + server in different regions).

Clone https://github.com/composablesys/collabs-rich-text-benchmarks
In server/package.json, update @automerge/... dependencies to the latest versions. (The existing versions are what's used in the preprint.)
Install and build:

npm ci && npm run build

Run a local experiment (7 minutes):

cd client
bash local_exp.sh ../data allActive 4 automergeRepo trial0

"4" is the number of users; replace "automergeRepo" with "automerge" to use the basic WebSocket echo server.
5. Analyze the data:

cd ../analysis
npm start ../results ../data/allActive-004-automergeRepo/

CSVs will be in results/. There are also client CPU profiles but not server CPU profiles (sorry).

Versions:

"@automerge/automerge": "2.1.5"
"@automerge/automerge-repo": "1.0.12"
"@automerge/automerge-repo-network-websocket": "1.0.12"
Node v18.17.1
Ubuntu 22.04.3

Peter van Hardenberg · Answer 1 · Thu Oct 19 2023 11:54:41 GMT+0800 (China Standard Time)

Thanks for the report, @mweidner037, this is really helpful. Your results seem vaguely plausible given the performance traces I've seen lately in the browser. (As an aside the Chrome Dev Tools have really excellent performance analysis tools.)

We've been looking at similar problems and have a few patches in the work but getting independent performance testing is always very welcome!

Orion Henry · Answer 2 · Fri Oct 27 2023 03:22:01 GMT+0800 (China Standard Time)

I've been looking into this - There are two issues related to marks causing you to call Automerge.marks() when you shouldn't need to. I will have a PR soon that fixes this. As for the automerge-repo falling down - I saw the same behavior but just by updating to a more recent version this seemed to go away. I was hoping you could confirm

     "server": {
       "dependencies": {
-        "@automerge/automerge": "2.1.2-alpha.0",
-        "@automerge/automerge-repo": "1.0.2",
-        "@automerge/automerge-repo-network-websocket": "1.0.2",
+        "@automerge/automerge": "2.1.5",
+        "@automerge/automerge-repo": "1.0.12",
+        "@automerge/automerge-repo-network-websocket": "1.0.12",
         "@collabs/collabs": "0.13.4",

Orion Henry · Answer 3 · Fri Oct 27 2023 03:24:55 GMT+0800 (China Standard Time)

Also I made an even more lightweight version of the automerge-server.js and haven't seen it go over 1% CPU usage. Not very useful as it's just an echo server but happy to share if you like.

mweidner037 · Answer 4 · Fri Oct 27 2023 05:27:09 GMT+0800 (China Standard Time)

Also I made an even more lightweight version of the automerge-server.js and haven't seen it go over 1% CPU usage. Not very useful as it's just an echo server but happy to share if you like.

Sure, I can change automerge mode to use that.

As for the automerge-repo falling down - I saw the same behavior but just by updating to a more recent version this seemed to go away.

I'll try this out when I can.

Aside: When you are running locally, you can make it more stressful (as if there were more users) by increasing the rate of edits. Here is the relevant constant: https://github.com/composablesys/collabs-nsdi/blob/master/client/src/scenarios/all_active.ts#L16

Probably 30 edits/second (value 33 ms) with 3-4 users will make things interesting. Remote latency P95 should be the most sensitive indicator.

Orion Henry · Answer 5 · Sat Oct 28 2023 01:41:52 GMT+0800 (China Standard Time)

Made a PR that adds a fast marksAt() method so you dont need to walk the whole text field on every insert

automerge/automerge#785

Orion Henry · Answer 6 · Sat Oct 28 2023 01:43:15 GMT+0800 (China Standard Time)

Something you could add to the benchmark that would be hugely valuable (and I was having trouble doing while reading your harness code) would be to have a server flamegraph added to the stats gathered maybe via the 0x package.

mweidner037 · Answer 7 · Wed Nov 08 2023 07:53:30 GMT+0800 (China Standard Time)

I updated the code to use Automerge.marksAt and also upgraded to the latest versions (automerge 2.1.6, automerge-repo-* 1.0.15). However, I'm seeing similar performance - e.g. with 8 users on AWS, latency spikes around 30 sec and does not recover.

(The updates are in a dev copy of the repo - I can invite you if you like.)

Something you could add to the benchmark that would be hugely valuable (and I was having trouble doing while reading your harness code) would be to have a server flamegraph added to the stats gathered maybe via the 0x package.

You should be able to change "node" to "0x" on this line. I tried it a few times but it was flaky - it would give me a flamegraph if I killed the process after a few seconds, but not after running a whole experiment (even shortened to 30 sec).