Live stage downtime
chrisandrewcl opened this issue · comments
Sometimes I notice my lambdas are called but no invocation is logged in the local sst dev
log. It is not that the log fails, but rather the connection between the lambda and the local server that silently fails. Just restarting the sst dev
process fixes the issue, but it is not always clear when it is happening.
Any ideas why this happens?
Also, given usage reports and observed behavior, it seems that the lambdas that failed to call home keep running doing nothing up until their timeout, which in some cases, is very wasteful. Even if the server is restarted, running lambdas wait all their timeout before trying to connect again.
Maybe this can be improved somehow?
Some ideas:
- Short bridge timeout, aborting earlier with clear feedback
- When the local server restarts, if the bridge is still running it should connect right away
- Make the bridge return ok to avoid stuck retry loops
- Sum of the above, but configurable
* This has been happening with several versions, but I have yet to test 0.0.403.
** Not sure if it is the correct term, but by "bridge" I mean the code that the lambda executes in live mode to connect with the local server.
Yeah I guess let's first figure out why the connection is getting dropped. Can you share some logs for when that happens? Both on the Cloudwatch side and locally?
Or if you have any clues as to when this happens. Are you leaving the CLI running overnight?
Can you share some logs for when that happens? Both on the Cloudwatch side and locally?
Ok, I'll look for it next time it happens.
Or if you have any clues as to when this happens. Are you leaving the CLI running overnight?
Not overnight, just normal usage for a few hours.
* But some of the suggestions above came from an unfortunate incident from my initial experimentations where I left in a hurry and didn't notice the sst remove
failed, so a single lone sqs message with a consumer without proper redrive kept tearing my dev account free tier and gave me an unexpected bill, so it would be nice to have a more budget-friendly behavior when the dev server for a live stage is not running. If you prefer to focus this issue on the disconnection issue, I can open another one for this part instead. Please let me know.
It disconnects quite often for me too, not sure why.
✓ No changes
api: https://******.lambda-url.us-east-1.on.aws/
time=2024-06-06T15:31:54.429+02:00 level=INFO msg="INFO unlocking app=*** stage=***"
time=2024-06-06T15:31:54.850+02:00 level=INFO msg="file event" path=***/sst-env.d.ts op=CHMOD
time=2024-06-06T15:31:54.850+02:00 level=INFO msg=publishing type=*watcher.FileChangedEvent
time=2024-06-06T15:31:54.850+02:00 level=INFO msg="checking if code needs to be rebuilt" file=***/sst-env.d.ts
time=2024-06-06T15:31:54.971+02:00 level=INFO msg="waiting for file changes"
time=2024-06-06T15:32:39.140+02:00 level=INFO msg="mqtt connection lost" error="websocket: close 1006 (abnormal closure): unexpected EOF"
time=2024-06-06T15:32:39.140+02:00 level=INFO msg="mqtt reconnecting"
time=2024-06-06T15:32:40.111+02:00 level=INFO msg="mqtt connected"
time=2024-06-06T15:33:40.115+02:00 level=INFO msg="mqtt connection lost" error="websocket: close 1006 (abnormal closure): unexpected EOF"
time=2024-06-06T15:33:40.117+02:00 level=INFO msg="mqtt reconnecting"
time=2024-06-06T15:33:41.021+02:00 level=INFO msg="mqtt connected"
time=2024-06-06T15:34:41.022+02:00 level=INFO msg="mqtt connection lost" error="websocket: close 1006 (abnormal closure): unexpected EOF"
time=2024-06-06T15:34:41.022+02:00 level=INFO msg="mqtt reconnecting"
time=2024-06-06T15:34:42.023+02:00 level=INFO msg="mqtt connected"
time=2024-06-06T15:35:42.029+02:00 level=INFO msg="mqtt connection lost" error="websocket: close 1006 (abnormal closure): unexpected EOF"
time=2024-06-06T15:35:42.030+02:00 level=INFO msg="mqtt reconnecting"
time=2024-06-06T15:35:42.942+02:00 level=INFO msg="mqtt connected"
time=2024-06-06T15:36:42.981+02:00 level=INFO msg="mqtt reconnecting"
time=2024-06-06T15:36:42.981+02:00 level=INFO msg="mqtt connection lost" error="websocket: close 1006 (abnormal closure): unexpected EOF"
time=2024-06-06T15:36:44.820+02:00 level=INFO msg="mqtt reconnecting"
time=2024-06-06T15:36:46.499+02:00 level=INFO msg="mqtt reconnecting"
EDIT: It might have been related to the fact that I forgot to turn off my VPN 🙃 . Checking now if it works as expected without it.
hey was this resolved?
@thdxr Not sure, it looks like it was intermittent, but I have yet to experience it again since opening this issue. Was it addressed somehow in the later versions?
Might be, I'll close this for now. Feel free to reopen.
I am still having problems with lambdas not invoking the local live server. Here are the logs:
sst dev
time=2024-06-27T16:33:08.247-03:00 level=INFO msg=iot topic=ion/project/candrew/02ff98b3f7ed49efaa08b8f966b1667f/init
time=2024-06-27T16:33:08.247-03:00 level=INFO msg="running function" runtime=nodejs20.x functionID=CronEveryHourHandler
time=2024-06-27T16:33:08.247-03:00 level=INFO msg="starting worker" env="...redacted..."
time=2024-06-27T16:33:08.248-03:00 level=INFO msg="worker died" workerID=02ff98b3f7ed49efaa08b8f966b1667f
lambda cloudwatch
INIT_REPORT Init Duration: 10009.14 ms Phase: init Status: timeout
2024/06/27 19:33:06 INFO getting endpoint
2024/06/27 19:33:07 INFO found endpoint endpoint url="...redacted..."
2024/06/27 19:33:07 INFO signed request url="...redacted..."
2024/06/27 19:33:07 INFO connecting to iot clientID=02ff98b3f7ed49efaa08b8f966b1667f
2024/06/27 19:33:07 INFO mqtt connected
2024/06/27 19:33:07 INFO prefix prefix=ion/project/candrew/02ff98b3f7ed49efaa08b8f966b1667f
2024/06/27 19:33:07 INFO get lambda runtime api url=127.0.0.1:9001
2024/06/27 19:33:07 INFO connecting to lambda runtime api
2024/06/27 19:33:07 INFO waiting for response
INIT_REPORT Init Duration: 300022.10 ms Phase: invoke Status: timeout
START RequestId: 5a0a218a-1b42-49a9-bf7b-1004e9447011 Version: $LATEST
2024-06-27T19:38:06.127Z 5a0a218a-1b42-49a9-bf7b-1004e9447011 Task timed out after 300.04 seconds
END RequestId: 5a0a218a-1b42-49a9-bf7b-1004e9447011
REPORT RequestId: 5a0a218a-1b42-49a9-bf7b-1004e9447011 Duration: 300042.37 ms Billed Duration: 300000 ms Memory Size: 128 MB Max Memory Used: 18 MB```