carvel-dev / kapp

kapp is a simple deployment tool focused on the concept of "Kubernetes application" — a set of resources with the same label

Home Page:https://carvel.dev/kapp

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Performance enhancements

praveenrewar opened this issue · comments

Describe the problem/challenge you have
We rely on list api calls to get information from the cluster which could put some burden on the cluster if the number of objects returned is high. When the number of apps being deployed using kapp increases (in cases of kapp-controller packages), this becomes a problem as the time taken to deploy the apps increases after a certain point without any burden on the cpu or memory of the cluster nodes.

  • Throttling warnings when there are multiple kapp apps being used at the same time.
  • socket: too many open files when ulimit is set to a low number (256)

*Describe the solution you'd like
We need to minimise the list calls as much as possible (Replacing them with get or watch is also an option).

Tasks

  • Instead of trying to get all the server resources, get only the ones that are related to the available GKs (anyway we get rid of the others later so they are not being used) - Cancelled for the time being

    • Spike: This require more code changes and makes the code less readable. So we will just revisit this later if required.
  • When we deploy an app, we first list the labeled resources (GVs) and then try to get the non labeled resources one by one one that are not found in the first step. When an app is deployed for the first time, the first step would always return nil, so maybe we could skip that?

    • Spike: How would this fit into kapp code base? -> PR
  • Use watch instead of get and list while waiting for resources to reconcile. Using watch will be helpful for resources that take more time to reconcile (for example deployments), but for resources that reconcile almost immediately (for example configmap), it might bring some overhead.

    • Spike: Test if reducing the wait time interval help with this?
      We increased the wait-check-interval to 3s as it is reducing api calls and not affecting deployment time much.
      PR for the same is here and the data collected while doing spike can check here
    • Spike: Test if using watch impacts performance. -> As we have increased wait-check-interval now and it is giving better result, we can have a look on watch later. Prioritising second steps for now.
  • When we have a CRD and CR present in the same manifest, we try to fetch the server resources again to find the CRD (since it wasn't present in the cached server resources). We should somehow avoid doing this as we wouldn't find the CRD this time as well. (No need to work on this if we already work on the first one)

  • Now that we have added the resource namespaces to the fallbackAllowedNamespaces, should we always use fallbackAllowedNamespaces instead of checking resources cluster wide?

    • Spike: Figure out if scoping to fallbackAllowedNamespaces could have any side effects (testing).
  • Currently we store the unique GKs in the meta ConfigMap and we do a list on the GKs, since list calls are more expensive we can check if doing get calls for all the resources is less expensive than list calls for unique GKs.

  • Improving performance enhancement specifically during diff stage. With go profiling, it was noticed that there are too many calls to deepCopy and AsYAMLBytes. PR

Anything else you would like to add:
It might be worth understanding the API priority and fairness.


Vote on this request

This is an invitation to the community to vote on issues, to help us prioritize our backlog. Use the "smiley face" up to the right of this comment to vote.

👍 "I would like to see this addressed as soon as possible"
👎 "There are other more important things to focus on right now"

We are also happy to receive and review Pull Requests if you want to help working on this issue.

Do we have reason to believe that it is the list call adding to the burden rather than get calls in the wait stage?
I believe those would be higher in number.

Not sure if it will be helpful, but this KEP elaborates on the thought process and goals of API fairness and priority in detail.

In particular, both list and get calls will be counted against the API fairness and priority budget in a way that watch calls are not (there's a separate budget for those, but the assumption is that they are long-running and the cost of the initial population is amortized over the duration of the watch, possibly in conjunction with the golang informer cache).

Is the one change listed here for the initial list the only performance change needed?

Do you need help setting up a test environment?

Hi @evankanderson I didn't mean to close it, but it got closed along with the PR, we are still working (although we are not able to spend much cycles) on some of the items from the list. Thank you so much for the help :)

This issue is being marked as stale due to a long period of inactivity and will be closed in 5 days if there is no response.