radian-software / riju

⚡ Extremely fast online playground for every programming language.

Home Page:https://riju.codes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Some containers are still not cleaned up

raxod502 opened this issue · comments

Got paged for this at 2am -_-

image
image

Aside from someone opening a number of tabs at once, the main problem with memory usage seems like it's come from containers not being cleaned up:

2021-09-15T08:23:18.403833694Z Sep 15 08:23:18 [85de97f2dcc74cacae79f3ac3e04bd2a] Creating session, language false
2021-09-15T08:23:19.905523390Z Sep 15 08:23:19 [85de97f2dcc74cacae79f3ac3e04bd2a] Tearing down session
2021-09-15T08:23:24.079569326Z Sep 15 08:23:24 [522c2a8d7d8b44239e4de2ee9689b75c] Creating session, language fish
2021-09-15T08:23:25.591740114Z Sep 15 08:23:25 [522c2a8d7d8b44239e4de2ee9689b75c] Tearing down session
2021-09-15T08:23:30.114268035Z Sep 15 08:23:30 [83a3f602aad94f4399b3b9ab08579a75] Creating session, language groovy
2021-09-15T08:23:31.588653754Z Sep 15 08:23:31 [83a3f602aad94f4399b3b9ab08579a75] Tearing down session
2021-09-15T08:23:34.474230653Z Sep 15 08:23:34 [378ee89bb6a042c691d5f89b120bfcc7] Creating session, language gap
2021-09-15T08:23:38.504119850Z Sep 15 08:23:38 [6e75aaee8d9e47e8b6bd13af09dd8580] Creating session, language fortran
2021-09-15T08:23:41.905103286Z Sep 15 08:23:41 [6e75aaee8d9e47e8b6bd13af09dd8580] Tearing down session
2021-09-15T08:23:55.709381625Z Sep 15 08:23:55 [d40738ba57094c37a8ca4640762ab7bd] Creating session, language cobol
2021-09-15T08:24:01.871654385Z Sep 15 08:24:01 [d40738ba57094c37a8ca4640762ab7bd] Tearing down session
2021-09-15T08:24:12.020033345Z Sep 15 08:24:12 [26b49ad05ba54b15a69c8c32a95d9eee] Creating session, language antecards
2021-09-15T08:24:15.227850670Z Sep 15 08:24:15 [26b49ad05ba54b15a69c8c32a95d9eee] Tearing down session
2021-09-15T08:24:20.060560376Z Sep 15 08:24:20 [9b4a8e0902a446899249a2f2fca10740] Creating session, language algol
2021-09-15T08:24:23.348434054Z Sep 15 08:24:23 [9b4a8e0902a446899249a2f2fca10740] Tearing down session
2021-09-15T08:24:26.802293359Z Sep 15 08:24:26 [e565af45b4dc40aeb6f1dd85aa668da0] Creating session, language afnix
2021-09-15T08:24:28.020948526Z Sep 15 08:24:28 [e565af45b4dc40aeb6f1dd85aa668da0] Tearing down session
2021-09-15T08:24:32.426413903Z Sep 15 08:24:32 [984364aefae544d4abeb020799808932] Creating session, language ante
2021-09-15T08:24:47.257043958Z Sep 15 08:24:47 [39d1a60d584842b3bb822df95d8b475d] Creating session, language gel
2021-09-15T08:24:51.492574248Z Sep 15 08:24:51 [9accf17faf7946c2b855ce60c29079f4] Creating session, language gdb
2021-09-15T08:24:52.557775904Z Sep 15 08:24:52 [9accf17faf7946c2b855ce60c29079f4] Tearing down session
2021-09-15T08:24:54.659807132Z Sep 15 08:24:54 [6ac1b0492baa4b7cb4f053c91b289b05] Creating session, language gap
2021-09-15T08:24:57.705151112Z Sep 15 08:24:57 [39105a7f0cd84cdbbe7ba5ab714601fd] Creating session, language gambas
2021-09-15T08:25:01.579222292Z Sep 15 08:25:01 [a54935d1326f45a09aaeb15f61e716c8] Creating session, language hack
2021-09-15T08:25:10.613626576Z Sep 15 08:25:10 [a54935d1326f45a09aaeb15f61e716c8] Tearing down session
2021-09-15T08:25:18.770395630Z Sep 15 08:25:18 [4ba129adc8dd4296ba725574fe626be9] Creating session, language haxe
2021-09-15T08:25:20.374993434Z Sep 15 08:25:20 Error: WebSocket is not open: readyState 2 (CLOSING)
2021-09-15T08:25:20.376462210Z Sep 15 08:25:20     at WebSocket.send (/src/node_modules/ws/lib/websocket.js:314:19)
2021-09-15T08:25:20.376477081Z Sep 15 08:25:20     at Session.send (file:///src/backend/api.js:166:15)
2021-09-15T08:25:20.376481463Z Sep 15 08:25:20     at ChildProcess.<anonymous> (file:///src/backend/api.js:330:16)
2021-09-15T08:25:20.376485493Z Sep 15 08:25:20     at ChildProcess.emit (node:events:394:28)
2021-09-15T08:25:20.376493228Z Sep 15 08:25:20     at ChildProcess.emit (node:domain:470:12)
2021-09-15T08:25:20.376496687Z Sep 15 08:25:20     at maybeClose (node:internal/child_process:1067:16)
2021-09-15T08:25:20.376500135Z Sep 15 08:25:20     at Socket.<anonymous> (node:internal/child_process:453:11)
2021-09-15T08:25:20.376510693Z Sep 15 08:25:20     at Socket.emit (node:events:394:28)
2021-09-15T08:25:20.376513729Z Sep 15 08:25:20     at Socket.emit (node:domain:470:12)
2021-09-15T08:25:20.376516444Z Sep 15 08:25:20     at Pipe.<anonymous> (node:net:662:12)
2021-09-15T08:25:20.417840106Z Sep 15 08:25:20 [4ba129adc8dd4296ba725574fe626be9] Tearing down session
2021-09-15T08:25:33.627040524Z Sep 15 08:25:33 [c8fb31e756754dcdaa210053a4937c3b] Creating session, language io
2021-09-15T08:29:10.851786621Z Sep 15 08:29:10 [f2b3f2e7954d46ec81c6a0ae3ed57517] Creating session, language less
2021-09-15T08:29:21.531177179Z Sep 15 08:29:21 [bb881a11c94043ec8079a09bf72b49c9] Creating session, language nelua
2021-09-15T08:31:03.384472898Z Sep 15 08:31:03 [bb881a11c94043ec8079a09bf72b49c9] Tearing down session
2021-09-15T08:31:04.464174473Z Sep 15 08:31:04 [00a485a8a2574d53b79f057d10646e0e] Creating session, language nelua
2021-09-15T08:32:47.187915024Z Sep 15 08:32:47 [00a485a8a2574d53b79f057d10646e0e] Tearing down session
2021-09-15T08:32:48.487120762Z Sep 15 08:32:48 [755e3a6f8ca443a2bc0ba9323e38cf52] Creating session, language nelua
2021-09-15T08:37:42.242107789Z Sep 15 08:37:42 [755e3a6f8ca443a2bc0ba9323e38cf52] Tearing down session
admin@ip-172-31-5-254:~$ sudo docker ps
CONTAINER ID   IMAGE                                                           COMMAND                  CREATED          STATUS          PORTS                                NAMES
831c5494b0a3   riju:lang-less-020ed29996e38a16cd5a29bd7aa0d444a29cb1d9         "/usr/local/sbin/my_…"   25 minutes ago   Up 25 minutes                                        riju-session-f2b3f2e7954d46ec81c6a0ae3ed57517
4aad599413df   riju:lang-io-986826bb00ad502b44d97e88292096e22ee5b4c5           "/usr/local/sbin/my_…"   28 minutes ago   Up 28 minutes                                        riju-session-c8fb31e756754dcdaa210053a4937c3b
080021d0c007   riju:lang-gambas-722b47c0304dd37b6d3433830b3c04220ca73eab       "/usr/local/sbin/my_…"   29 minutes ago   Up 29 minutes                                        riju-session-39105a7f0cd84cdbbe7ba5ab714601fd
b7b038346383   riju:lang-gap-4975f73761f8fa881390194f4f5086601cb6909c          "/usr/local/sbin/my_…"   29 minutes ago   Up 29 minutes                                        riju-session-6ac1b0492baa4b7cb4f053c91b289b05
648fef306d44   riju:lang-gel-0488e207128b9549196226a8d161b0945947e4f5          "/usr/local/sbin/my_…"   29 minutes ago   Up 29 minutes                                        riju-session-39d1a60d584842b3bb822df95d8b475d
060f124e1a5a   riju:lang-ante-d3131f5d8969dee5b1bc1387f896699cb291db4c         "/usr/local/sbin/my_…"   29 minutes ago   Up 29 minutes                                        riju-session-984364aefae544d4abeb020799808932
da0a4f91d0e6   riju:lang-gap-4975f73761f8fa881390194f4f5086601cb6909c          "/usr/local/sbin/my_…"   30 minutes ago   Up 30 minutes                                        riju-session-378ee89bb6a042c691d5f89b120bfcc7
4b29d77a3adf   riju:lang-carp-d4a04d4eba9611f9cbd84b1c50e363b6a7a8a1cc         "/usr/local/sbin/my_…"   33 minutes ago   Up 33 minutes                                        riju-session-7b63c6092c164ea0bae3c2f8e4c78098
95ee4cc847b0   riju:lang-j-7c38ef5c7f44c3c561b7e96c870c6242683dad50            "/usr/local/sbin/my_…"   34 minutes ago   Up 34 minutes                                        riju-session-98b0714858824b58b254bc0880efb6b2
bfc2b8fc6ff5   riju:lang-kotlin-d1b5c52d0197281ef7e08f388cef7b62da501ed6       "/usr/local/sbin/my_…"   23 hours ago     Up 23 hours                                          riju-session-9fb83bfc31744e0eac7a27781b16b359
cd08aa6fcb24   riju:lang-javascript-9bb992b670ea091d03e52d1f758a8f76c90059b5   "/usr/local/sbin/my_…"   23 hours ago     Up 23 hours                                          riju-session-8ca8a814872540f29c25f6029b7d4c4e
365f21b105ce   riju:lang-javascript-9bb992b670ea091d03e52d1f758a8f76c90059b5   "/usr/local/sbin/my_…"   23 hours ago     Up 23 hours                                          riju-session-041a796ad8f24d3aa57a0fd8d46bedb4
7e86af5eae15   riju:lang-javascript-9bb992b670ea091d03e52d1f758a8f76c90059b5   "/usr/local/sbin/my_…"   23 hours ago     Up 23 hours                                          riju-session-7004628ec6484ea19040173b38307737
8c870ecf43e1   riju:lang-lua-435c8d7589bb82b7e5fbaa73df24282ed952c5c9          "/usr/local/sbin/my_…"   23 hours ago     Up 23 hours                                          riju-session-104a89fadaeb4f839f5be4fbdf9c5a2b
893c835fc3e1   riju:lang-kotlin-d1b5c52d0197281ef7e08f388cef7b62da501ed6       "/usr/local/sbin/my_…"   23 hours ago     Up 23 hours                                          riju-session-c8fcd1b63e5d4275accccfcdd7304fa4
9928c65edee3   riju:lang-javascript-9bb992b670ea091d03e52d1f758a8f76c90059b5   "/usr/local/sbin/my_…"   23 hours ago     Up 23 hours                                          riju-session-50defd6571e140fa943d53c1802baaf1
7ce4d6222100   riju:lang-javascript-9bb992b670ea091d03e52d1f758a8f76c90059b5   "/usr/local/sbin/my_…"   23 hours ago     Up 23 hours                                          riju-session-a9ee98a290af44eab810852bcbfd98b9
14d705421978   riju:lang-lua-435c8d7589bb82b7e5fbaa73df24282ed952c5c9          "/usr/local/sbin/my_…"   23 hours ago     Up 23 hours                                          riju-session-f70aa4104990452ba39207b1028597ce
dacc30edd141   riju:lang-javascript-9bb992b670ea091d03e52d1f758a8f76c90059b5   "/usr/local/sbin/my_…"   23 hours ago     Up 23 hours                                          riju-session-b9389a4518844562a2680b81de93575c
8a6319cf8558   riju:app-deeff5850afa3abb61363d6627af99312035129c               "/usr/local/sbin/my_…"   2 weeks ago      Up 2 weeks      6120/tcp, 127.0.0.1:6230->6119/tcp   riju-app-green
475ac022fe74   riju:lang-python-9791918a7b8fa220cdc395fded5935ba7095932f       "/usr/local/sbin/my_…"   2 weeks ago      Up 2 weeks                                           riju-session-4a31366839194646b0833d865488e905

I captured the full container logs for later analysis since they don't seem to be being shipped to Loki (#98), killed the rogue containers, and failed over the server process using the supervisor API.

Failing over the server process at least killed the old containers, so that part is working:

admin@ip-172-31-5-254:~$ sudo docker ps
CONTAINER ID   IMAGE                                                       COMMAND                  CREATED          STATUS          PORTS                                NAMES
e31da105e27d   riju:lang-java-010d1cd1b7b87faa411e8eb4f2f3a6131b9a177c     "/usr/local/sbin/my_…"   6 seconds ago    Up 5 seconds                                         riju-session-0bdca862493848f68723cf3cd7c60581
52dce1dcb55b   riju:app-deeff5850afa3abb61363d6627af99312035129c           "/usr/local/sbin/my_…"   14 seconds ago   Up 13 seconds   6120/tcp, 127.0.0.1:6229->6119/tcp   riju-app-blue
475ac022fe74   riju:lang-python-9791918a7b8fa220cdc395fded5935ba7095932f   "/usr/local/sbin/my_…"   2 weeks ago      Up 2 weeks                                           riju-session-4a31366839194646b0833d865488e905

image

We also have significantly more donations now to fund the AWS spend (thanks @Salakar!), so I'll bump the instance size from t3.small to t3.medium, which should help make things less fragile in general.

Will roll that out in the morning since I don't think things are on fire right now and the rollout takes ~30 minutes.

All in all... this actually went reasonably well, in that I successfully got paged before things fell over, and there was a clear remediation that wasn't going to break again in a few hours.

But it's become crystal clear to me why at Plaid we have our oncall set up across timezones so nobody gets paged in the middle of the night local time.

Once I understand the systems a bit more I'd be happy to take over EST...

Looks like this is happening again:

image

Capturing more diagnostic information, including ps aux dump in case that's helpful to debug. Mitigating this time by upsizing the instance as discussed above.

I was able to reproduce this locally: the session doesn't get torn down if you close the tab right after opening it. That was a race condition in the session management code, which should be fixed by the above commit.

Also, I updated PagerDuty to use the "support hours" feature, where pages during the night will wait until the morning to escalate to high priority.