openHPI / poseidon

Scalable task execution orchestrator for CodeOcean

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Sentry Integration causes OOM kill

MrSerth opened this issue · comments

Today, we had another occurrence of Poseidon being OOM killed.

Stack Trace

Jan 24 08:27:45 poseidon-terraform poseidon[624]: heap profile: 194: 751991664 [9538813: 1723733127104] @ heap/1048576
Jan 24 08:27:45 poseidon-terraform poseidon[624]: 1: 83886080 [41: 3439329280] @ 0x4a9d4e 0x4a977d 0x4a9959 0x719b1d 0x71aade 0x71c3b6 0x71beef 0x71aade 0x71c70f 0x71aade 0x71c70f 0x717fd3 0x717b99 0x717490 0x78e7d9 0x78ea3c 0x78f889 0x7784d2 0x
77bb0d 0x77baf7 0x78b2fd 0x47b8df 0x78a5a5 0x78a56f 0xe46015 0x6e6a69 0x6e974e 0x6e5634 0x470781
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x4a9d4d#011bytes.growSlice+0x8d#011#011#011#011#011#011/usr/lib/go-1.21/src/bytes/buffer.go:249
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x4a977c#011bytes.(*Buffer).grow+0x13c#011#011#011#011#011/usr/lib/go-1.21/src/bytes/buffer.go:151
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x4a9958#011bytes.(*Buffer).Write+0x58#011#011#011#011#011/usr/lib/go-1.21/src/bytes/buffer.go:179
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x719b1c#011encoding/json.uintEncoder+0x13c#011#011#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:525
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x71aadd#011encoding/json.structEncoder.encode+0x21d#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:706
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x71c3b5#011encoding/json.arrayEncoder.encode+0xd5#011#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:849
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x71beee#011encoding/json.sliceEncoder.encode+0x30e#011#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:822
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x71aadd#011encoding/json.structEncoder.encode+0x21d#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:706
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x71c70e#011encoding/json.ptrEncoder.encode+0x20e#011#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:878
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x71aadd#011encoding/json.structEncoder.encode+0x21d#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:706
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x71c70e#011encoding/json.ptrEncoder.encode+0x20e#011#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:878
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x717fd2#011encoding/json.(*encodeState).reflectValue+0x72#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:323
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x717b98#011encoding/json.(*encodeState).marshal+0xb8#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:295
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x71748f#011encoding/json.Marshal+0xcf#011#011#011#011#011/usr/lib/go-1.21/src/encoding/json/encode.go:162
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x78e7d8#011github.com/getsentry/sentry-go.envelopeFromBody+0x518#011#011/home/poseidon/go/pkg/mod/github.com/getsentry/sentry-go@v0.25.0/transport.go:196
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x78ea3b#011github.com/getsentry/sentry-go.getRequestFromEvent+0xfb#011#011/home/poseidon/go/pkg/mod/github.com/getsentry/sentry-go@v0.25.0/transport.go:232
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x78f888#011github.com/getsentry/sentry-go.(*HTTPTransport).SendEvent+0xc8#011/home/poseidon/go/pkg/mod/github.com/getsentry/sentry-go@v0.25.0/transport.go:359
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x7784d1#011github.com/getsentry/sentry-go.(*Client).processEvent+0x2b1#011/home/poseidon/go/pkg/mod/github.com/getsentry/sentry-go@v0.25.0/client.go:637
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x77bb0c#011github.com/getsentry/sentry-go.(*Client).CaptureEvent+0x6c#011/home/poseidon/go/pkg/mod/github.com/getsentry/sentry-go@v0.25.0/client.go:447
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x77baf6#011github.com/getsentry/sentry-go.(*Hub).CaptureEvent+0x56#011#011/home/poseidon/go/pkg/mod/github.com/getsentry/sentry-go@v0.25.0/hub.go:224
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x78b2fc#011github.com/getsentry/sentry-go.(*Span).doFinish+0x15c#011#011/home/poseidon/go/pkg/mod/github.com/getsentry/sentry-go@v0.25.0/tracing.go:356
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x47b8de#011sync.(*Once).doSlow+0xbe#011#011#011#011#011/usr/lib/go-1.21/src/sync/once.go:74
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x78a5a4#011sync.(*Once).Do+0x44#011#011#011#011#011#011/usr/lib/go-1.21/src/sync/once.go:65
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x78a56e#011github.com/getsentry/sentry-go.(*Span).Finish+0xe#011#011/home/poseidon/go/pkg/mod/github.com/getsentry/sentry-go@v0.25.0/tracing.go:205
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110xe46014#011main.initServer.(*Handler).Handle.(*Handler).handle.func1+0x314#011/home/poseidon/go/pkg/mod/github.com/getsentry/sentry-go@v0.25.0/http/sentryhttp.go:119
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x6e6a68#011net/http.HandlerFunc.ServeHTTP+0x28#011#011#011#011/usr/lib/go-1.21/src/net/http/server.go:2136
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x6e974d#011net/http.serverHandler.ServeHTTP+0x8d#011#011#011#011/usr/lib/go-1.21/src/net/http/server.go:2938
Jan 24 08:27:45 poseidon-terraform poseidon[624]: ##0110x6e5633#011net/http.(*conn).serve+0x5f3#011#011#011#011#011/usr/lib/go-1.21/src/net/http/server.go:2009


At the same time, Influx also showed erroneous behaviour:

Stack Trace

Jan 24 08:27:25 poseidon-terraform poseidon[624]: 2024/01/24 08:27:25 influxdb2client E! Write error: internal error: unexpected error writing points to database: timeout, batch kept for retrying
Jan 24 08:27:25 poseidon-terraform poseidon[624]: time="2024-01-24T08:27:25.603586Z" level=debug code=200 duration=15.765001853s method=GET path=/api/v1/runners/29-9a31d753-ba91-11ee-8c54-fa163e351aea/files user_agent="Faraday v2.9.0"
Jan 24 08:27:25 poseidon-terraform poseidon[624]: time="2024-01-24T08:27:25.604447Z" level=debug code=200 duration=16.191098103s method=GET path=/api/v1/runners/29-9a31d753-ba91-11ee-8c54-fa163e351aea/websocket user_agent=
Jan 24 08:27:37 poseidon-terraform telegraf[632]: 2024-01-24T08:27:37Z E! [outputs.influxdb_v2] When writing to [http://monitoring-terraform.compute.internal:8086]: Post "http://monitoring-terraform.compute.internal:8086/api/v2/write?bucket=tele
graf&org=codeocean": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Jan 24 08:27:37 poseidon-terraform telegraf[632]: 2024-01-24T08:27:37Z E! [agent] Error writing to outputs.influxdb_v2: failed to send metrics to any configured server(s)
Jan 24 08:27:38 poseidon-terraform poseidon[624]: 2024/01/24 08:27:38 influxdb2client E! Write error: internal error: unexpected error writing points to database: timeout, batch kept for retrying

We decided to focus on the Sentry part first (postponing InfluxDB). Maybe, the OOM kill is related to the buffer of Sentry events awaiting transmission to the relay. Thus, it could be well connected to the high memory threshold experience during network outages.

WIP Questions to investigate:

  • Are we using the wrong metric in watchMemoryAndAlert? HeapAlloc vs HeapInuse. [1], [2]
  • Only 751991664 Bytes (752 MB) have been used by Poseidon (HeapInuse). Why has it got OOM killed?
  • Is the Sentry issue unrelated to the OOM and the Influx Connection Issues because Sentry is just marshaling some JSON with an ok-ish heap usage?
  • If not the heap usage, what else led Poseidon to reach its 4 GB RAM? Did it actually or is this issue just based on the Sentry warning? (No data due to retention policy)

We agreed to postpone this issue until the next occurrence.