OOM Crashes on Juno Pod After Restart During Heavy Load

Question

OOM Crashes on Juno Pod After Restart During Heavy Load

wojciechos opened this issue 2 months ago · comments

Increased traffic targeting the starknet_call method on our k8s pod pushed CPU usage to 100%, leading to request failures and block sync issues. Subsequent restarts of the pod resulted in immediate OOM errors at startup. However, after applying a fresh database, the pod started to sync properly without any OOM issues which suggests that db has been corrupted(?).

k8s Logs:

terminated
Reason: OOMKilled - exit code: 137
Started at: 2024-04-19T15:14:04+05:30
Finished at: 2024-04-19T15:14:51+05:30

Possible Causes:

Potential database corruption during restarts combined with high CPU load.
Recent Pebble updates

//UPDATE - 06.05.2024
Pod unable to keep up with syncing, resulting in failed requests due to reaching CPU limit.
Actions taken: Added more pods, restarted pod, but no improvement.
Resolution: Removing and replacing the DB resolved the issue.
Next steps: Prioritize investigating and fixing the underlying cause.

06-05-2024-incident.pdf