Frequent OOM kills - Isolated to`http2` - 50 GB of RAM usage on MacBook, causing system OOM

Question

Frequent OOM kills - Isolated to`http2` - 50 GB of RAM usage on MacBook, causing system OOM

huntharo opened this issue 5 months ago · comments

Intro

First off - awesome program! This solves the problems I have with hey where it just gets slower when the RTT times increase even though the remote service can support the throughput. The animated UI really helps understand what's happening without having to wait for the whole thing to finish, which I love. Thanks for this!

Pairing

I can pair on this with you if you want. Google Meet or similar is fine. My email is on my profile.

Problem

I've encountered numerous OOM kills when using the program
The problems seem to happen more frequently with http2 and/or may only happen with http2
Had a case where my MacBook said it was out of memory (32 GB of RAM)
- oha was using 50 GB of RAM!
- This was no unusual test it just started failing in some way that causes a memory leak or accumulation
The case below, on AWS CloudShell runs for ~40 seconds each time before getting OOM killed
- Everything is initially fine with CPU usage around 30% and memory usage around 0.5%
- The program appears to freeze around 40 seconds
- Memory usage shoots up to 25%, 35%, and more after the freeze
- CPU usage shoots up to 100% after the freeze starts
- The program exits reliably with exit code 137 (pretty sure this is an OOM kill)
- Runs to completion if --http2 is removed and -c is adjusted to have the same total as -c * -p with --http2

`Killed` Terminations on CloudShell

Note: this endpoint is not open to random IPs but if you want to test against it, it is located in us-east-2 and I can add an IP to the security group for you to test with if you'd like.

./oha-linux-amd64 -t 15s -z 5m --http2 -c 10 -p 20 https://lambdadispatch.ghpublic.pwrdrvr.com/read

Killed at 43 seconds

Happens with `--no-tui` Too

Does NOT Happen without `--http2` with Same Worker Count

Nearly Final Memory

Problem Starts Memory - ~30 seconds of operation

Initial Memory - 2,000 RPS and Stable (0.4% memory usage)

hatoo · Answer 1 · Thu Jan 04 2024 15:36:52 GMT+0800 (China Standard Time)

Thanks for your report!

I had some investigation and I found some weird huge memory consumption for oha against a specific HTTP server (such as node18's https module) backend although this isn't exact same phenomenon for this issue.

Could you tell me about your server's technology? (language, library, etc..)

Harold Hunt · Answer 2 · Thu Jan 04 2024 22:32:31 GMT+0800 (China Standard Time)

Could you tell me about your server's technology? (language, library, etc..)

The pwrdrvr.com domain name above is pointed at an AWS ALB which supports both HTTP2 and HTTP1.1. The ALB, in turn, is pointed at a dotnet 8 Kestrel web server using HTTP1.1, which in turn is pointed at a proxy inside of Lambda functions using HTTP2, which is finally proxying the request to a Node.js app. The Node.js app is reading from DynamoDB and returning an average of 1.6 KB in payload size through the layers above.

But essentially, the problem appears to happen when speaking to an AWS ALB with HTTP2.

The local problem I had where it used 50 GB of RAM was potentially pointed at the Node.js app using HTTP2 or at the dotnet 8 Kestrel server using HTTP1.1 (I don't have HTTP2 enabled for the Kestrel server, but I can). The details are fuzzy on this because it only happened once so far.

I could probably setup an AWS ALB route for you that just returns a constant response string and I bet the issue will happen with that.

Harold Hunt · Answer 3 · Fri Jan 05 2024 11:05:02 GMT+0800 (China Standard Time)

I've got a solid lead now. I was running the code under the debugger and looking heal profiles using pprof.

What I noticed from normal operations is that memory accumulates steadily. The primary usage, I think, is coming from a vector that holds all the results, so that makes sense. Maybe that can be reduced if the detailed stats are not needed until the end, but maybe it cannot.

Then I was thinking that Node.js, locally, never sends a GOAWAY message on an http2 socket while the ALB likely does send that after a number of requests, say, 10,000 or 100,000 per socket.

I realized that the problem was likely when the sockets were gracefully closed by the server. To simulate what would happen in that case I just ctrl-c'd my node.js process and, sure enough, oha started racing at 800% CPU and RAM usage in the dev container went from 3% of total to 20%, 40%, 80%, then OOM kill.

tl;dr - To Reproduce

For tests with server: Start node app locally with TLS cert and http2 support (probably any other stack would be fine too, just have http2)
Start server, start http2 test: cargo run --release -- -c 20 -p 20 -z 5m --http2 --insecure https://host.docker.internal:3001/ping
a. Ctrl-c the http2 server
b. Observe memory usage of oha with top - It's going to start jumping rapidly until the process is OOM killed within a few seconds to 10s of seconds
c. CPU usage will jump to 800%, if available
d. UI becomes unresponsive and prints no further info
e. UI cannot be ctrl-c'd
Do NOT restart server, start http2 test: cargo run --release -- -c 20 -p 20 -z 5m --http2 --insecure https://host.docker.internal:3001/ping
a. If oha is started when the server is not running, it will report 20 refused connections and immediately exit. This is not the same as what happens if the connections are established but then lost.
Start server, start http1.1 test: cargo run --release -- -c 200 -z 10m --insecure https://host.docker.internal:3001/ping
a. Ctrl-C the http server
b. Observe that oha reports refused connections and remains responsive for http1.1 when the server goes away
c. Observe that the UI remains responsive and can be ctrl-c'd

hatoo · Answer 4 · Fri Jan 05 2024 13:05:57 GMT+0800 (China Standard Time)

Thank you. It's very helpful!

I've succeeded in reproducing. I will work on this Saturday.

Harold Hunt · Answer 5 · Sat Jan 06 2024 00:42:59 GMT+0800 (China Standard Time)

I have submitted a partial PR that operates mostly similarly to the way that this is handled for HTTP1.1: #363

I have a couple to-dos on the PR description.

Frequent OOM kills - Isolated to`http2` - 50 GB of RAM usage on MacBook, causing system OOM

Intro

Pairing

Problem

Killed Terminations on CloudShell

Killed at 43 seconds

Happens with --no-tui Too

Does NOT Happen without --http2 with Same Worker Count

Nearly Final Memory

Problem Starts Memory - ~30 seconds of operation

Initial Memory - 2,000 RPS and Stable (0.4% memory usage)

tl;dr - To Reproduce

`Killed` Terminations on CloudShell

Happens with `--no-tui` Too

Does NOT Happen without `--http2` with Same Worker Count