Stdout and stderr handling can be extremely slow on Windows

Question

Stdout and stderr handling can be extremely slow on Windows

bturner opened this issue 4 years ago · comments

When used to pump processes that produce lots of output in small chunks (which happens a lot with git commands--especially git http-backend and git upload-pack, which are used to serve clones and fetches), the existing I/O completion handler code in ProcessCompletions runs afoul of how IOCP decides when to signal ready reads. When some data is available to read, IOCP waits (a very short wait) to see if more data arrives before triggering. For short-lived processes, or processes that produce their output in big chunks, that works fine. But if the process produces output in many, many tiny pieces, that extra delay amounts to a huge performance hit.

This came up for Bitbucket Server in BSERV-12599, where it was reported that, since we switched over to NuProcess to run git http-backend and git upload-pack, hosting operations on Windows are an order of magnitude slower than they were using ProcessBuilder and blocking I/O.

To give a sense of scale, cloning a 500MB repository (Bitbucket Server's own source) via Bitbucket Server using ProcessBuilder looks like this:

C:\Temp>git clone --no-checkout http://localhost:7990/bitbucket/scm/at/bitbucket-server.git
Cloning into 'bitbucket-server'...
remote: Enumerating objects: 2304995, done.
remote: Counting objects: 100% (2304995/2304995), done.
remote: Compressing objects: 100% (779774/779774), done.
remote: Total 2304995 (delta 1058323), reused 2304995 (delta 1058323), pack-reused 0
Receiving objects: 100% (2304995/2304995), 480.51 MiB | 33.96 MiB/s, done.
Resolving deltas: 100% (1058323/1058323), done.

480MB at 34MB/s, with the entire operation taking about 25 seconds (the 34MB/s transfer is only part of the overall time).

Switching over to NuProcess completely tanks performance:

C:\Temp>git clone --no-checkout http://localhost:7990/bitbucket/scm/at/bitbucket-server.git
Cloning into 'bitbucket-server'...
remote: Enumerating objects: 2304995, done.
remote: Counting objects: 100% (2304995/2304995), done.
remote: Compressing objects: 100% (779774/779774), done.
remote: Total 2304995 (delta 1058323), reused 2304995 (delta 1058323), pack-reused 0
Receiving objects: 100% (2304995/2304995), 480.51 MiB | 2.08 MiB/s, done.
Resolving deltas: 100% (1058323/1058323), done.

We've dropped from 34MB/s to 2MB/s, and the overall operation now takes over 3 minutes. For larger repositories the difference is even more painful, taking clones that previously ran in 30-60 seconds and blowing them out to 10-15 minutes. That results in stacking load on Bitbucket Server that eventually causes rejected requests due to excessive queuing.

I stripped out all Bitbucket Server's code and wrote a test in NuProcess that runs git http-backend directly, with the right stdin and environment to produce the same effective operation. (Unfortunately this test isn't really shareable because it relies on some canned stdin I captured, as well as access to a specific Git repository.) With that test, I'm able to reproduce the performance issue without any Bitbucket Server code at all. (It's worth noting that the test executed on Linux or macOS performs fine, with NuProcess speeds essentially identical to ProcessBuilder.)

In trying to track down the issue, I looked through the JDK's source and found they use 4096 + 24 byte buffers for their pipes. Changing WindowsProcess.BUFFER_SIZE from 64K to 4096 + 24 fixes the issue and produces identical throughput with NuProcess compared to ProcessBuilder.

A colleague helping me search for this found some other cases where IOCP's Nagle-like approach has caused problems:

https://news.ycombinator.com/item?id=19520953
python-trio/trio#52
- This one links off to several issues Chrome developers have run into related to IOCP performance

Bryan Turner · Answer 1 · Fri Sep 25 2020 10:39:57 GMT+0800 (China Standard Time)

I believe the reason reducing the pipe buffer size in WindowsProcess fixes the performance problem we're seeing is because once the pipe buffer is full there's no reason to wait for any more input--there's no room for it anyway. With that (small) delay avoided, the overall operation ends up being much, much faster despite needing to move more buffers.