Support scalability tests

Question

Support scalability tests

xyhuang opened this issue 6 years ago · comments

The Kubebench should be extended to support scalability tests in 2 ways:

run a job with many workers
run many jobs in parallel

It should also be able to collect metrics during the runs and make it easy to analyze the results, refer to #124 for this.

Also related with
kubeflow/training-operator#830

Xinyuan Huang · Answer 1 · Wed Oct 10 2018 07:24:28 GMT+0800 (China Standard Time)

/priority p1

Josh Bottum · Answer 2 · Sun Nov 04 2018 08:06:31 GMT+0800 (China Standard Time)

/priority p2

chrisheecho · Answer 3 · Tue Nov 06 2018 08:16:08 GMT+0800 (China Standard Time)

No eng resource assigned, priority 2

@xyhuang are you planning to take this on? if so, if you can assign yourself and please tag us so that PMs can add this to the kanban board

chrisheecho · Answer 4 · Tue Nov 06 2018 08:16:18 GMT+0800 (China Standard Time)

/remove-priority p1

Xinyuan Huang · Answer 5 · Sat Nov 17 2018 22:20:33 GMT+0800 (China Standard Time)

@chrisheecho sorry for the late response. I will try to implement this if I got time, but we might likely move it to 0.5, I agree to keep it as p2 for now.

Richard Liu · Answer 6 · Tue Jan 08 2019 08:49:44 GMT+0800 (China Standard Time)

@xyhuang @swiftdiaries

Let's try to have this for 0.5.

A few things to consider:

How should we automate this?
I think it makes sense to create a periodic Prow workflow that runs this daily. We should use a separate GCP project so that the scale tests don't interfere with regular presubmits.
What tests should we run?
As a starting point we can consider these items for tf-operator:

Lots of workers: kubeflow/training-operator#830
Lots of concurrent jobs: kubeflow/training-operator#829

What would it take for kubebench to support these?

Collecting metrics

Error rates
Latency (how fast does each pending job get processed? And how long does it take for a worker to start?)
Throughput (how many jobs/workers can we run concurrently?)
CPU/memory usage of the operator

Dashboard
We currently have a kubebench-dashboard. Can we use it to track load test results?

Xinyuan Huang · Answer 7 · Tue Jan 08 2019 14:24:54 GMT+0800 (China Standard Time)

@richardsliu here are some quick answers, we will discuss in more details:

1 & 2 agree, i will create a few issues to track required changes, they should be doable within 0.5.
3 today benchmark metrics/results are collected in 2 ways: (1) sending run-time metrics through monitoring infra (e.g. prometheus) (2) collecting job performance numbers through a user-defined "postjob", which interprets the outputs from tf jobs. if required info can be collected in one of these ways it should be easy, else we will figure it out.
4 possible with some changes. currently we use a on-prem backend for result storage/viz, for tracking results in prow tests it's probably easier to leverage gcloud resources (bigtable? stackdriver?), that should be supported with a few small changes.

that being said, here are the needed things in my mind:

create prow workflow (for 1)
add support for benchmarking concurrent jobs in a single kubebench job (for 2)
minor workload improvements (yaml configs + minor code changes if needed) for actual tests (for 2)
support collecting required metrics (for 3)
leverage gcloud backend for result storage and viz (for 4)

Richard Liu · Answer 8 · Wed Jan 09 2019 00:08:00 GMT+0800 (China Standard Time)

Let's split up the work. I can take care of item 1 (set up project, cluster, and Prow workflow).