tensorflow / build

Build-related tools for TensorFlow

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Container Road Map

angerson opened this issue · comments

Road Map for Docker Containers

This is the same roadmap document that I'm using internally, with the internal bits taken out.

I am forcing these containers to get continuous support by using them for TF's internal CI: if they don't work, then our tests don't work. While I'm getting that ready during Q4 and Q1, I'm explicitly avoiding features that the TF team is not going to use, which would be dead-on-arrival unless we set up more testing for them, which I don't have the cycles to consider yet.

TF Nightly Milestone - Q4 Q1

Goal: Replicable container build of our tf-nightly Ubuntu packages

  • Containers can build tf-nightly package
  • SIG Build repository explains how to build tf-nightly package in Containers
  • Documentation exists on how to make changes to the containers
  • Suite of Kokoro jobs exists that publishes 80%-identical-to-now tf-nightly via containers
  • TF-nightly is officially built with the new containers
  • Documentation exists on how to use and debug containers

Release Test Milestone - Q4 Q1

Goal: Replicable container builds of our release tests, supporting each release

  • Containers can run same-as-now Nightly release tests
  • SIG Build repository explains how to run release tests as we do
  • Suite of CI jobs exists that matches current rel/nightly jobs
  • Existing release jobs replaced (but reversible if needed) by Container-based equivalent
  • Containers may be maintained and updated separately for TF release branches
  • Containers used for nightly/release libtensorflow and ci_sanity (now "code_check") jobs

CI & RBE Milestone - Q4 Q1/Q2

Goal: The main tests and our RBE tests use the same Docker container, updated in one place

  • Containers support internal presubmit/continuous build behavior
  • Containers are used in internal buildcop-monitored, DevInfra-owned presubmit/continuous jobs
  • Containers can be used in RBE
  • Containers are used as RBE environment for internal buildcop-monitored, DevInfra-owned jobs
  • DevInfra's GitHub-side presubmit tests use the containers
  • Containers are published on gcr.io
  • There is an easy way to verify if a change to the containers will not break the whole internal test suite

Forward Planning Milestone - Q2

Goal: Establish clear plan for any future work related to these containers. This is internal team planning stuff so I've removed it.

Downstream & OSS Milestone - Q2/Q3

Goal: Downstream users and custom-op developers use the same containers as our CI

  • SIG Addons / SIG IO use these Containers (or derivative) instead of old custom-op ones
  • Custom-op documentation migrated to SIG Build repository
  • Resolve: what to do about inconvenient default packages for e.g. SIG Addons (keras-nightly, etc.)
  • Resolve: what to do about inconveniently large image sizes for e.g. GPU content not needed
  • Docker-related documentation on tensorflow.org replaced with these containers
  • "devel" containers deprecated in favor of SIG Build containers
commented

Thanks for sharing the roadmap.
It could be a little bit hard to understand steps mentioning "internal/our" requirements but I think it is expected.

Taking a look at the new Github Actions that we have here in the repository it is really super-clear what we are doing and when we are what on the OSS side with the limit to what we have orchestrated with Github Action.

When we are mixing OSS receipts/code and internal not visible stuffs/steps (e.g. like orchestration, args like commits for nightly etc..) it could be a little bit hard to follow the machinery if the not visible part is not compensated by some documentation details (e.g. what event/cron will start the scripts, what is the script chains, what are the args etc..).

But also having this documentation compensation generally it will bet under a constant risk to be outdated as probably internal teams have a direct visibility on the internal changes and so the operations will be not directly impacted by an outdated public documentation.

But as Github Actions rely on a well know and popular YAML dialect, and Github users/contributors/develoeprs are generally skilled on this dialect, do you think that it could be possible to setup a TF own self-hosted Github Actions runners on the Google Cloud so that we have a complete overview on the TF OSS build and orchestration and probably also a little bit of autonomy to the SIG without adding too much overhead to the system?

A Google Cloud team is maintaining all the tools to (auto)deploy self-hosted Github Actions runners on Google GKE:
https://github.com/terraform-google-modules/terraform-google-github-actions-runners