Live stage downtime

Question

Live stage downtime

chrisandrewcl opened this issue 2 months ago · comments

Sometimes I notice my lambdas are called but no invocation is logged in the local sst dev log. It is not that the log fails, but rather the connection between the lambda and the local server that silently fails. Just restarting the sst dev process fixes the issue, but it is not always clear when it is happening.

Any ideas why this happens?

Also, given usage reports and observed behavior, it seems that the lambdas that failed to call home keep running doing nothing up until their timeout, which in some cases, is very wasteful. Even if the server is restarted, running lambdas wait all their timeout before trying to connect again.

Maybe this can be improved somehow?

Some ideas:

Short bridge timeout, aborting earlier with clear feedback
When the local server restarts, if the bridge is still running it should connect right away
Make the bridge return ok to avoid stuck retry loops
Sum of the above, but configurable

* This has been happening with several versions, but I have yet to test 0.0.403.
** Not sure if it is the correct term, but by "bridge" I mean the code that the lambda executes in live mode to connect with the local server.

Jay · Answer 1 · Thu Jun 06 2024 07:46:15 GMT+0800 (China Standard Time)

Yeah I guess let's first figure out why the connection is getting dropped. Can you share some logs for when that happens? Both on the Cloudwatch side and locally?

Or if you have any clues as to when this happens. Are you leaving the CLI running overnight?

Chris Lopes · Answer 2 · Thu Jun 06 2024 19:25:36 GMT+0800 (China Standard Time)

Can you share some logs for when that happens? Both on the Cloudwatch side and locally?

Ok, I'll look for it next time it happens.

Or if you have any clues as to when this happens. Are you leaving the CLI running overnight?

Not overnight, just normal usage for a few hours.

* But some of the suggestions above came from an unfortunate incident from my initial experimentations where I left in a hurry and didn't notice the sst remove failed, so a single lone sqs message with a consumer without proper redrive kept tearing my dev account free tier and gave me an unexpected bill, so it would be nice to have a more budget-friendly behavior when the dev server for a live stage is not running. If you prefer to focus this issue on the disconnection issue, I can open another one for this part instead. Please let me know.

Brais Piñeiro · Answer 3 · Thu Jun 06 2024 21:41:51 GMT+0800 (China Standard Time)

It disconnects quite often for me too, not sure why.

✓  No changes
   api: https://******.lambda-url.us-east-1.on.aws/

time=2024-06-06T15:31:54.429+02:00 level=INFO msg="INFO unlocking app=*** stage=***"
time=2024-06-06T15:31:54.850+02:00 level=INFO msg="file event" path=***/sst-env.d.ts op=CHMOD
time=2024-06-06T15:31:54.850+02:00 level=INFO msg=publishing type=*watcher.FileChangedEvent
time=2024-06-06T15:31:54.850+02:00 level=INFO msg="checking if code needs to be rebuilt" file=***/sst-env.d.ts
time=2024-06-06T15:31:54.971+02:00 level=INFO msg="waiting for file changes"
time=2024-06-06T15:32:39.140+02:00 level=INFO msg="mqtt connection lost" error="websocket: close 1006 (abnormal closure): unexpected EOF"
time=2024-06-06T15:32:39.140+02:00 level=INFO msg="mqtt reconnecting"
time=2024-06-06T15:32:40.111+02:00 level=INFO msg="mqtt connected"
time=2024-06-06T15:33:40.115+02:00 level=INFO msg="mqtt connection lost" error="websocket: close 1006 (abnormal closure): unexpected EOF"
time=2024-06-06T15:33:40.117+02:00 level=INFO msg="mqtt reconnecting"
time=2024-06-06T15:33:41.021+02:00 level=INFO msg="mqtt connected"
time=2024-06-06T15:34:41.022+02:00 level=INFO msg="mqtt connection lost" error="websocket: close 1006 (abnormal closure): unexpected EOF"
time=2024-06-06T15:34:41.022+02:00 level=INFO msg="mqtt reconnecting"
time=2024-06-06T15:34:42.023+02:00 level=INFO msg="mqtt connected"
time=2024-06-06T15:35:42.029+02:00 level=INFO msg="mqtt connection lost" error="websocket: close 1006 (abnormal closure): unexpected EOF"
time=2024-06-06T15:35:42.030+02:00 level=INFO msg="mqtt reconnecting"
time=2024-06-06T15:35:42.942+02:00 level=INFO msg="mqtt connected"
time=2024-06-06T15:36:42.981+02:00 level=INFO msg="mqtt reconnecting"
time=2024-06-06T15:36:42.981+02:00 level=INFO msg="mqtt connection lost" error="websocket: close 1006 (abnormal closure): unexpected EOF"
time=2024-06-06T15:36:44.820+02:00 level=INFO msg="mqtt reconnecting"
time=2024-06-06T15:36:46.499+02:00 level=INFO msg="mqtt reconnecting"

EDIT: It might have been related to the fact that I forgot to turn off my VPN 🙃 . Checking now if it works as expected without it.

Dax · Answer 4 · Sat Jun 08 2024 03:41:28 GMT+0800 (China Standard Time)

hey was this resolved?

Chris Lopes · Answer 5 · Tue Jun 18 2024 20:32:06 GMT+0800 (China Standard Time)

@thdxr Not sure, it looks like it was intermittent, but I have yet to experience it again since opening this issue. Was it addressed somehow in the later versions?

Jay · Answer 6 · Wed Jun 26 2024 00:50:41 GMT+0800 (China Standard Time)

Might be, I'll close this for now. Feel free to reopen.

Chris Lopes · Answer 7 · Fri Jun 28 2024 05:47:40 GMT+0800 (China Standard Time)

I am still having problems with lambdas not invoking the local live server. Here are the logs:

sst dev

time=2024-06-27T16:33:08.247-03:00 level=INFO msg=iot topic=ion/project/candrew/02ff98b3f7ed49efaa08b8f966b1667f/init
time=2024-06-27T16:33:08.247-03:00 level=INFO msg="running function" runtime=nodejs20.x functionID=CronEveryHourHandler
time=2024-06-27T16:33:08.247-03:00 level=INFO msg="starting worker" env="...redacted..."
time=2024-06-27T16:33:08.248-03:00 level=INFO msg="worker died" workerID=02ff98b3f7ed49efaa08b8f966b1667f

lambda cloudwatch

INIT_REPORT Init Duration: 10009.14 ms	Phase: init	Status: timeout
2024/06/27 19:33:06 INFO getting endpoint
2024/06/27 19:33:07 INFO found endpoint endpoint url="...redacted..."
2024/06/27 19:33:07 INFO signed request url="...redacted..."
2024/06/27 19:33:07 INFO connecting to iot clientID=02ff98b3f7ed49efaa08b8f966b1667f
2024/06/27 19:33:07 INFO mqtt connected
2024/06/27 19:33:07 INFO prefix prefix=ion/project/candrew/02ff98b3f7ed49efaa08b8f966b1667f
2024/06/27 19:33:07 INFO get lambda runtime api url=127.0.0.1:9001
2024/06/27 19:33:07 INFO connecting to lambda runtime api
2024/06/27 19:33:07 INFO waiting for response
INIT_REPORT Init Duration: 300022.10 ms	Phase: invoke	Status: timeout
START RequestId: 5a0a218a-1b42-49a9-bf7b-1004e9447011 Version: $LATEST
2024-06-27T19:38:06.127Z 5a0a218a-1b42-49a9-bf7b-1004e9447011 Task timed out after 300.04 seconds

END RequestId: 5a0a218a-1b42-49a9-bf7b-1004e9447011
REPORT RequestId: 5a0a218a-1b42-49a9-bf7b-1004e9447011	Duration: 300042.37 ms	Billed Duration: 300000 ms	Memory Size: 128 MB	Max Memory Used: 18 MB```

Chris Lopes · Answer 8 · Fri Jun 28 2024 05:48:51 GMT+0800 (China Standard Time)

@jayair @thdxr Do the above logs have any useful info?