Support scalability tests
xyhuang opened this issue · comments
The Kubebench should be extended to support scalability tests in 2 ways:
- run a job with many workers
- run many jobs in parallel
It should also be able to collect metrics during the runs and make it easy to analyze the results, refer to #124 for this.
Also related with
kubeflow/training-operator#830
/priority p1
/priority p2
No eng resource assigned, priority 2
@xyhuang are you planning to take this on? if so, if you can assign yourself and please tag us so that PMs can add this to the kanban board
/remove-priority p1
@chrisheecho sorry for the late response. I will try to implement this if I got time, but we might likely move it to 0.5, I agree to keep it as p2 for now.
Let's try to have this for 0.5.
A few things to consider:
-
How should we automate this?
I think it makes sense to create a periodic Prow workflow that runs this daily. We should use a separate GCP project so that the scale tests don't interfere with regular presubmits. -
What tests should we run?
As a starting point we can consider these items for tf-operator:
- Lots of workers: kubeflow/training-operator#830
- Lots of concurrent jobs: kubeflow/training-operator#829
What would it take for kubebench to support these?
- Collecting metrics
- Error rates
- Latency (how fast does each pending job get processed? And how long does it take for a worker to start?)
- Throughput (how many jobs/workers can we run concurrently?)
- CPU/memory usage of the operator
- Dashboard
We currently have a kubebench-dashboard. Can we use it to track load test results?
@richardsliu here are some quick answers, we will discuss in more details:
1 & 2 agree, i will create a few issues to track required changes, they should be doable within 0.5.
3 today benchmark metrics/results are collected in 2 ways: (1) sending run-time metrics through monitoring infra (e.g. prometheus) (2) collecting job performance numbers through a user-defined "postjob", which interprets the outputs from tf jobs. if required info can be collected in one of these ways it should be easy, else we will figure it out.
4 possible with some changes. currently we use a on-prem backend for result storage/viz, for tracking results in prow tests it's probably easier to leverage gcloud resources (bigtable? stackdriver?), that should be supported with a few small changes.
that being said, here are the needed things in my mind:
- create prow workflow (for 1)
- add support for benchmarking concurrent jobs in a single kubebench job (for 2)
- minor workload improvements (yaml configs + minor code changes if needed) for actual tests (for 2)
- support collecting required metrics (for 3)
- leverage gcloud backend for result storage and viz (for 4)
Let's split up the work. I can take care of item 1 (set up project, cluster, and Prow workflow).