kyma-project / hydroform

Infrastructure SDK for provisioning and managing Kubernetes cluster based on Terraform

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Parallel install: improve resource consumption in parallel-install

akgalwas opened this issue · comments

Description

While testing Provisioner with parallel-install library (PR-678) I noticed significant resources consumption. I understand that by design the new installer will consume more resources, however, we should analyse the parallel-install code base to check what could be improved.

Reasons

There were some stress tests performed with Provisioner. The following scenarios have been executed:

  1. Running 15 parallel installations (8 components each) with current resource limit setting. Some operations will wait in the Provisioner's queue.
  2. Running 10 parallel installations (8 components each) with various resource limit settings. Some operations will wait in the Provisioner's queue.
  3. Running 5 parallel installations (19 components each) with various resource limit settings. All operations will run at once.

Conclusions:

  • It is impossible to execute any of the above scenarios with current resource limits (cpu: 400m, memory: 1Gi) ; as a result Provisioner's container has been restarted.
  • Increasing resource limits to 1 cpu and 2Gi improved situation; there were some crashes observed, however, they were caused by Provisioner and #316.
  • Comparing resource usage in scenario 3 for parallel-install and KymaOperator method I noticed the following:
    • Resources usage characteristics is different in both cases. For parallel-install the peak cpu and memory consumption is higher and lasts longer.
    • In case of Kyma Operator based installation the highest observed resource usage was 494 m and 505 MiB. For parallel-install the highest observed resource usage was 1033 m and 1006 MiB.
  • We can treat scenario with 5 parallel installation as very close to the production, as Provisioner have queue for provisioning of size 5. But the important fact is, that there are also other queues (for deprovisioning and upgrade) with 5 workers each. So there could be up to 15 provisioning, deprovisioning and upgrade concurrent operations. So the overall limit should probably be something like 3 cpu and 3 Gi.
  • Measurements should be repeated when crashes observed during testing are fixed.
  • In order to fully understand what is the performance characteristics of parallel-install some benchmarks need to be performed. It is crucial to scenarios like the one with Provisioner.

This issue or PR has been automatically marked as stale due to the lack of recent activity.
Thank you for your contributions.

This bot triages issues and PRs according to the following rules:

  • After 60d of inactivity, lifecycle/stale is applied
  • After 7d of inactivity since lifecycle/stale was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Close this issue or PR with /close

If you think that I work incorrectly, kindly raise an issue with the problem.

/lifecycle stale

This issue is no longer valid, the parallel installation module has been removed.