Specifying Folders In `buf generate --path` Arguments Causes Massive Performance Hit
JesseObrien opened this issue · comments
Hi, I've been chatting through this problem on the buf slack in a thread here. I've been discussing it with @jhump for the most part.
The problem is arising when we're calling buf generate
with a folder in the --path
arguments versus calling it with files in the --path
arguments.
In simple terms:
A) buf generate --config=buf.yaml --path=/foo/bar/directory
takes ~1 minute to generate .ts
files for 11 .proto
files nested in that directory.
B) buf generate --config=buf.yaml --path=/foo/bar/directory/file1.proto,/foo/bar/directory/file2.proto,...
takes <1 second to generated .ts files for 11 .proto
files nested in the same directory.
The root folder we're calling buf generate
from is a very large monorepo with hundreds of thousands of files. If we do not recursively expand all .proto
files and inject them into that one --path
argument (or specify them as 11 separate --path
arugments), buf generate
becomes 60+x slower.
If I can provide any more context let me know. I verified this by running buf generate
a bunch of different times without expanding the files and specifying the folder to make sure it's the folder that's causing it.
That doesn't seem that unexpected - if buf
has to search /foo/bar/directory
for all relevant .proto
files, that's going to take some time (and I'm certain that however buf
searches for it is not as optimized as some typical bash tools are) - we can look into optimizing that path a bit, but searching a directory with 100,000+ files for 11 specific .proto
files is going to take some time.
@bufdev, IIRC, the foo/bar/directory
folder does not have that many files. The issue is that the "input" to buf
was unspecified, and thus default to the current working directory. The current working directory is the root of the repo and huge. When --path
indicates a file, it is fast. But it seems like --path
with a directory name isn't actually looking only at that one directory but instead collecting everything in the "input" module (so scanning the huge repo root directory) and then filtering the result based on prefix match. (The above is my suspicion based on the observed behavior; I haven't gone through the implementation code yet to confirm what it's doing.)
That shouldn't be the case - we have optimized for that scenario, so it should only do the search on the directory specified in --path
. There may be a regression - we have to play with this locally.