kubeflow / kubebench

Repository for benchmarking

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support scalability tests

xyhuang opened this issue · comments

The Kubebench should be extended to support scalability tests in 2 ways:

  • run a job with many workers
  • run many jobs in parallel

It should also be able to collect metrics during the runs and make it easy to analyze the results, refer to #124 for this.

Also related with
kubeflow/training-operator#830

/priority p1

/priority p2

No eng resource assigned, priority 2

@xyhuang are you planning to take this on? if so, if you can assign yourself and please tag us so that PMs can add this to the kanban board

/remove-priority p1

@chrisheecho sorry for the late response. I will try to implement this if I got time, but we might likely move it to 0.5, I agree to keep it as p2 for now.

@xyhuang @swiftdiaries

Let's try to have this for 0.5.

A few things to consider:

  1. How should we automate this?
    I think it makes sense to create a periodic Prow workflow that runs this daily. We should use a separate GCP project so that the scale tests don't interfere with regular presubmits.

  2. What tests should we run?
    As a starting point we can consider these items for tf-operator:

What would it take for kubebench to support these?

  1. Collecting metrics
  • Error rates
  • Latency (how fast does each pending job get processed? And how long does it take for a worker to start?)
  • Throughput (how many jobs/workers can we run concurrently?)
  • CPU/memory usage of the operator
  1. Dashboard
    We currently have a kubebench-dashboard. Can we use it to track load test results?

@richardsliu here are some quick answers, we will discuss in more details:

1 & 2 agree, i will create a few issues to track required changes, they should be doable within 0.5.
3 today benchmark metrics/results are collected in 2 ways: (1) sending run-time metrics through monitoring infra (e.g. prometheus) (2) collecting job performance numbers through a user-defined "postjob", which interprets the outputs from tf jobs. if required info can be collected in one of these ways it should be easy, else we will figure it out.
4 possible with some changes. currently we use a on-prem backend for result storage/viz, for tracking results in prow tests it's probably easier to leverage gcloud resources (bigtable? stackdriver?), that should be supported with a few small changes.

that being said, here are the needed things in my mind:

  • create prow workflow (for 1)
  • add support for benchmarking concurrent jobs in a single kubebench job (for 2)
  • minor workload improvements (yaml configs + minor code changes if needed) for actual tests (for 2)
  • support collecting required metrics (for 3)
  • leverage gcloud backend for result storage and viz (for 4)

Let's split up the work. I can take care of item 1 (set up project, cluster, and Prow workflow).