beacon-biosignals / Ray.jl

Julia API for Ray

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Roll back GlobalStateAccessor/GCSClient changes, re-test, and tag release

kleinschmidt opened this issue · comments

In an attempt to bring Ray.jl in alignment with upstream ray trunk, we replaced the "log file parsing" method of discovering connection parameters with the blessed method of using the GlobalStateAccessor. In the process we also swapped out the backend for the JuliaGCSClient from the PythonGCSClient in favor of the C++ GCSClient. Somewhere in this process, mysterious segfaults started to appear when running jobs on our kubernetes cluster (#226).

Given Beacon's current priorities, we can't dedicate the engineering effort to understanding the root cause of this, but still want to leave a release in a good state so at least our internal users can use Ray with some level of confidence. In order to do that, I propose we

  • roll back these changes from
  • merge in-flight PRs
  • re-test everything to make sure it all still works!
    • CI
    • local machine benchmark workload (beacon-internal)
    • k8s cluster benchmark workload (beacon-internal)
  • cut a release
  • file issues to follow-up on the root cause of GSA/GCS-related segfaults on k8s.