Kodiak missing from checks
styfle opened this issue · comments
Kodiak is missing from the checks on this PR: vercel/vercel#10133
Even though the automerge label is applied, it seems like it will never merge https://github.com/vercel/vercel/blob/main/.kodiak.toml#L4C1-L4C1
Is kodiak down right now?
I've been looking at this for a while and I think the issue is the number of web hook events in the installation's queue.
Kodiak has a single queue per installation to process all web hooks. Whenever a PR is merged, Kodiak enqueues all open PRs (up to 100 per repo) to recheck merge status. So if you have repos with a large number of open PRs, there's going to be a large number of PRs added to the queue.
There's currently 321 PRs in the queue for evaluation, so Kodiak will take some time to evaluate those based on GitHub API rate limits.
I wonder if we can filter out PRs more here or increase concurrency:
Lines 214 to 215 in d6f1720
I think the move going forward is to update Kodiak to have more tasks working on the web hook queue. If we have rate limit issues, we can figure something else out
up to 100 per repo to recheck merge status. So if you have repos with a large number of open PRs, there's going to be a large number of PRs added to the queue.
Is it repo or org? Because the repo only has 56 open PRs which is less than the "100 per repo".
I tested it and confirmed that all 56 return in a single api response:
curl -sSL 'https://api.github.com/repos/vercel/vercel/pulls?per_page=100' | jq '. | length'
It’s per organization, so it can add up across repos
Since PRs are technically Issues, you could change the API call from /pulls
to /issues
and filter by the label.
For example, this returns the 1 open PR that has the pr: automerge
label:
https://api.github.com/repos/vercel/vercel/issues?per_page=100&labels=pr%3A+automerge
Now that only works when the repo is configured to include the automerge_label
but that could reduce the number in the queue quite a bit.
My plan is to increase the concurrency for web hook ingestion
I've deployed a change to increase the concurrency for your organization to 5 ingest workers. I'm going to monitor the GitHub API rate limits and increase the worker count if the rate limiting isn't an issue
I realized yesterday that Kodiak throttles API requests to 5000/hour by default. Since your organization is on GitHub Enterprise, Kodiak has a 15,000/hour rate limit.
I've deployed a change that enables your organization to utilize the 15k rate limit
Wonderful thanks!
I'll close this issue for now and I can re-open if it happens again 👍
Re-opening since the problem is happening again now.
Kodiak shows nothing in the queue but there is a PR that says its 3rd in the queue for about 12 minutes now.
It might be the rate limit, the logging on rate limits is lacking, so I'll turn down the internal Kodiak limit for now
I added some better logging, it seems like we're hitting a different rate limit.
I'm seeing errors for GraphQL queries (https://api.github.com/graphql
) and check-runs https://api.github.com/repos/vercel/api/check-runs
{
"documentation_url": "https://docs.github.com/en/free-pro-team@latest/rest/overview/resources-in-the-rest-api#secondary-rate-limits",
"message": "You have exceeded a secondary rate limit. Please wait a few minutes before you try again."
}
I think we can update our throttling logic to wait for the "retry-after" time when we hit a secondary rate limit
I'm actually unsure here, looking at the /rate_limit
API, we have plenty of room. The only header we're receiving on the secondary rate limit error is a retry-after: 60
.
I think it may be because we're making many requests concurrently, which the GitHub docs explicitly call out as a no-no 😬 : https://docs.github.com/en/rest/guides/best-practices-for-integrators?apiVersion=2022-11-28#dealing-with-secondary-rate-limits
Make requests for a single user or client ID serially. Do not make requests for a single user or client ID concurrently.
We could wait 60 seconds after hitting this limit, but that's going to slow things down
{
"resources": {
"core": {
"limit": 15000,
"used": 3,
"remaining": 14997,
"reset": 1688135146
},
"search": {
"limit": 30,
"used": 0,
"remaining": 30,
"reset": 1688131627
},
"graphql": {
"limit": 12500,
"used": 120,
"remaining": 12380,
"reset": 1688135139
},
"integration_manifest": {
"limit": 5000,
"used": 0,
"remaining": 5000,
"reset": 1688135167
},
"source_import": {
"limit": 100,
"used": 0,
"remaining": 100,
"reset": 1688131627
},
"code_scanning_upload": {
"limit": 1000,
"used": 0,
"remaining": 1000,
"reset": 1688135167
},
"actions_runner_registration": {
"limit": 10000,
"used": 0,
"remaining": 10000,
"reset": 1688135167
},
"scim": {
"limit": 15000,
"used": 0,
"remaining": 15000,
"reset": 1688135167
},
"dependency_snapshots": {
"limit": 100,
"used": 0,
"remaining": 100,
"reset": 1688131627
},
"code_search": {
"limit": 10,
"used": 0,
"remaining": 10,
"reset": 1688131627
}
},
"rate": {
"limit": 15000,
"used": 3,
"remaining": 14997,
"reset": 1688135146
}
}
Why 60 seconds? The linked docs say 1 second.
If you're making a large number of POST, PATCH, PUT, or DELETE requests for a single user or client ID, wait at least one second between each request.
Oh, I was just saying that because that’s what the retry-after header said
I suppose if the rate limit is hit, then yes it will have to wait for the duration of Retry-After
.
But maybe setting a delay of 1 sec between requests will ensure the rate limit is never hit?
This one has been ready to merge for about 20 min vercel/vercel#10164
Thanks for the ping here. I think I need to synchronize the graphql calls so we only make one per installation at a time. It seems like the REST API isn't returning the secondary rate limit errors.
I triggered a refresh of that pull request and it merged (you can do this by editing the pull request title, description, labels, or commits).
Kodiak wasn't actively looking at this PR, I think Kodiak hit the secondary rate limit and dropped the event, because we don't retry on those errors.
My TODOs are:
- add some sort of lock so we only make one concurrent graphql request per installation
- when we hit a secondary rate limit, enqueue the pull request again for evaluation instead of dropping it
I think that's because of the secondary rate limit, I'm going to try to temporarily reduce the graphql requests by limiting the ingest workers to 1
Hey @styfle,
Sorry for the delayed reply. Last week after my message above I made a change to requeue pull requests on api rate limits, which I think ensures we no longer drop pull requests for these limits
I’ll take a look at the PR you linked
This error is unrelated to this original issue
https://github.com/vercel/next.js/runs/14992559502
Here’s the issue for this error: #853
I talked to GitHub a while back and the product team there said this was expected behavior