chdsbd / kodiak

Kodiak is missing from the checks on this PR: vercel/vercel#10133

Even though the automerge label is applied, it seems like it will never merge https://github.com/vercel/vercel/blob/main/.kodiak.toml#L4C1-L4C1

Is kodiak down right now?

I've been looking at this for a while and I think the issue is the number of web hook events in the installation's queue.

Kodiak has a single queue per installation to process all web hooks. Whenever a PR is merged, Kodiak enqueues all open PRs (up to 100 per repo) to recheck merge status. So if you have repos with a large number of open PRs, there's going to be a large number of PRs added to the queue.

There's currently 321 PRs in the queue for evaluation, so Kodiak will take some time to evaluate those based on GitHub API rate limits.

I wonder if we can filter out PRs more here or increase concurrency:

kodiak/bot/kodiak/queue.py

Lines 214 to 215 in d6f1720

    
           prs = await api_client.get_open_pull_requests(base=branch_name) 
        
           if prs is None:

I think the move going forward is to update Kodiak to have more tasks working on the web hook queue. If we have rate limit issues, we can figure something else out

up to 100 per repo to recheck merge status. So if you have repos with a large number of open PRs, there's going to be a large number of PRs added to the queue.

Is it repo or org? Because the repo only has 56 open PRs which is less than the "100 per repo".

https://github.com/vercel/vercel/pulls?q=is%3Apr+is%3Aopen

I tested it and confirmed that all 56 return in a single api response:

curl -sSL 'https://api.github.com/repos/vercel/vercel/pulls?per_page=100' | jq '. | length'

It’s per organization, so it can add up across repos

Since PRs are technically Issues, you could change the API call from /pulls to /issues and filter by the label.

For example, this returns the 1 open PR that has the pr: automerge label:

https://api.github.com/repos/vercel/vercel/issues?per_page=100&labels=pr%3A+automerge

Now that only works when the repo is configured to include the automerge_label but that could reduce the number in the queue quite a bit.

My plan is to increase the concurrency for web hook ingestion

I've deployed a change to increase the concurrency for your organization to 5 ingest workers. I'm going to monitor the GitHub API rate limits and increase the worker count if the rate limiting isn't an issue

I realized yesterday that Kodiak throttles API requests to 5000/hour by default. Since your organization is on GitHub Enterprise, Kodiak has a 15,000/hour rate limit.

I've deployed a change that enables your organization to utilize the 15k rate limit

Wonderful thanks!

I'll close this issue for now and I can re-open if it happens again 👍

Re-opening since the problem is happening again now.

Kodiak shows nothing in the queue but there is a PR that says its 3rd in the queue for about 12 minutes now.

It might be the rate limit, the logging on rate limits is lacking, so I'll turn down the internal Kodiak limit for now

I added some better logging, it seems like we're hitting a different rate limit.

I'm seeing errors for GraphQL queries (https://api.github.com/graphql) and check-runs https://api.github.com/repos/vercel/api/check-runs

{
  "documentation_url": "https://docs.github.com/en/free-pro-team@latest/rest/overview/resources-in-the-rest-api#secondary-rate-limits",
  "message": "You have exceeded a secondary rate limit. Please wait a few minutes before you try again."
}

https://docs.github.com/en/free-pro-team@latest/rest/overview/resources-in-the-rest-api#secondary-rate-limits

I think we can update our throttling logic to wait for the "retry-after" time when we hit a secondary rate limit

I'm actually unsure here, looking at the /rate_limit API, we have plenty of room. The only header we're receiving on the secondary rate limit error is a retry-after: 60.

I think it may be because we're making many requests concurrently, which the GitHub docs explicitly call out as a no-no 😬 : https://docs.github.com/en/rest/guides/best-practices-for-integrators?apiVersion=2022-11-28#dealing-with-secondary-rate-limits

Make requests for a single user or client ID serially. Do not make requests for a single user or client ID concurrently.

We could wait 60 seconds after hitting this limit, but that's going to slow things down

{
  "resources": {
    "core": {
      "limit": 15000,
      "used": 3,
      "remaining": 14997,
      "reset": 1688135146
    },
    "search": {
      "limit": 30,
      "used": 0,
      "remaining": 30,
      "reset": 1688131627
    },
    "graphql": {
      "limit": 12500,
      "used": 120,
      "remaining": 12380,
      "reset": 1688135139
    },
    "integration_manifest": {
      "limit": 5000,
      "used": 0,
      "remaining": 5000,
      "reset": 1688135167
    },
    "source_import": {
      "limit": 100,
      "used": 0,
      "remaining": 100,
      "reset": 1688131627
    },
    "code_scanning_upload": {
      "limit": 1000,
      "used": 0,
      "remaining": 1000,
      "reset": 1688135167
    },
    "actions_runner_registration": {
      "limit": 10000,
      "used": 0,
      "remaining": 10000,
      "reset": 1688135167
    },
    "scim": {
      "limit": 15000,
      "used": 0,
      "remaining": 15000,
      "reset": 1688135167
    },
    "dependency_snapshots": {
      "limit": 100,
      "used": 0,
      "remaining": 100,
      "reset": 1688131627
    },
    "code_search": {
      "limit": 10,
      "used": 0,
      "remaining": 10,
      "reset": 1688131627
    }
  },
  "rate": {
    "limit": 15000,
    "used": 3,
    "remaining": 14997,
    "reset": 1688135146
  }
}

Why 60 seconds? The linked docs say 1 second.

If you're making a large number of POST, PATCH, PUT, or DELETE requests for a single user or client ID, wait at least one second between each request.

Oh, I was just saying that because that’s what the retry-after header said

I suppose if the rate limit is hit, then yes it will have to wait for the duration of Retry-After.

But maybe setting a delay of 1 sec between requests will ensure the rate limit is never hit?

This one has been ready to merge for about 20 min vercel/vercel#10164

Thanks for the ping here. I think I need to synchronize the graphql calls so we only make one per installation at a time. It seems like the REST API isn't returning the secondary rate limit errors.

I triggered a refresh of that pull request and it merged (you can do this by editing the pull request title, description, labels, or commits).

Kodiak wasn't actively looking at this PR, I think Kodiak hit the secondary rate limit and dropped the event, because we don't retry on those errors.

My TODOs are:

add some sort of lock so we only make one concurrent graphql request per installation
when we hit a secondary rate limit, enqueue the pull request again for evaluation instead of dropping it

Now I'm seeing a new error message. Perhaps GitHub's API is down?

Reference: vercel/next.js#52178

I think that's because of the secondary rate limit, I'm going to try to temporarily reduce the graphql requests by limiting the ingest workers to 1

I'm seeing that error again today:

problem contacting GitHub API with method 'pull_request/update_branch'

vercel/next.js#52580

Hey @styfle,

Sorry for the delayed reply. Last week after my message above I made a change to requeue pull requests on api rate limits, which I think ensures we no longer drop pull requests for these limits

I’ll take a look at the PR you linked

This error is unrelated to this original issue

https://github.com/vercel/next.js/runs/14992559502

Here’s the issue for this error: #853

I talked to GitHub a while back and the product team there said this was expected behavior

Is this the same bug as #853 ?

kodiakhq: status — ⚠️ problem contacting GitHub API with method 'pull_request/merge'

vercel/next.js#53520

Closing since it seems the remaining issues we've had are the same as #853

	prs = await api_client.get_open_pull_requests(base=branch_name)
	if prs is None:

Kodiak missing from checks