kubeflow / testing

Test infrastructure and tooling for Kubeflow.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Auto-push Kubeflow deployments corresponding to different Kubeflow versions.

jlewi opened this issue · comments

It would be nice if we could auto-push dev.kubeflow.org or perhaps create another environment nightly-dev.kubeflow.org.

There are certain changes like UI that are difficult to test manually. It would be useful if we had an up todate environment.

I think we could easily adapt our E2E tests to do this and then just run it as a cron job.

It might be interesting to look at using Weave Flux
https://www.weave.works/blog/gitops-operations-by-pull-request

Weave Flux supports keeping K8s infrastructure in sync with config defined in a git repository.

Not sure if ksonnet is supported yet.

Here's an idea for how to make progress

  1. We could use cron job to regularly run recreate_app.sh
  2. Create a PR using hub CLI for GitHub
  3. Manually approve the PR

Either use argo-cd or a cron job to call redeploy_app.sh to keep the deployed instance in sync with what's checked in.

kubeflow/kubeflow#1100

Added a script for upgrading the ksonnet application.

We'd like to maintain a pool of Kubeflow deployments corresponding to different Kubeflow versions.

e.g

kf-vX.Y-n00
...
kf-vX.Y-n05

The reason for having a pool of deployments for a given major release (X.Y) is that we want to run tests against these clusters. We don't want to interrupt the tests when deploying an updated version. So by maintaining a pool and cycling through them we can create a new version for new tests while letting already running tests run to completion.

We will want to recycle the names because we want to have a fixed set of endpoints e.g

https://kf-v0-4-n00.endpoints.kubeflow-ci.cloud.goog/

The reason for having a fixed set of endpoints is that we have to manually set the redirect URIs on the OAuth credentials used for IAP. So by recycling the endpoints we don't have to manually update the OAuth credential with new redirect URIs.

#269 created a python script to deploy Kubeflow.

The next steps would be

  • Update the script in #269 to add logic to recycle the deployments
    * Check if there is an unused name kf-vX.Y-n00,...kf-vX.Y-n05 and if there is use it
    * Otherwise delete the oldest deployment and create a new one using its name.

  • Create a cron job to run the above script and redeploy using the latest commit for each release

  • Create a simple flask app to provide information about all the different deployments
    * E.g. so we can see when each endpoint was last updated

  • Recycle SSL certificates for each endpoint

Boskos might be useful for managing clusters.

Remaining tasks for the cronjob itself:

  • Add github repos (kubeflow/testing) version snapshot.
  • Dockerfile dependencies clean up.
  • Deployed clusters labeling (used to identify the time deployment is made)
  • Add service account role bindings garbage collection to workflow.
  • Add a workflow to create git PR.

Some follow on issues:
#314 - Use self signed certificates to avoid lets-encrypt quota issues
#273 - E2E tests need a way to get the cluster to use
#316 - Web app to auto redirect to latest cluster

I think we can close this issue once the auto-deployed endpoints are working. I think #314 is the only blocking issue for that.

This is working for master.

For 0.4 it looks like there was a typo in the cron job that should be fixed by #327.

#315 is tracking fixing this for 0.4. So I'm going to mark this as fixed.