SwissDataScienceCenter / amalthea

A kubernetes operator for spawning and exposing jupyter servers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Automatically remove sessions in Pending state

pameladelgado opened this issue · comments

Sessions that have been stuck in a Pending/Errored state for a threshold amount of time should be removed automatically.

That's a good point. I think we can also define a certain number of restarts (eg 10?) which would lead to deletion.

A bit more details and requirements:

  • there should be a flag in the values file to enable or disable this
  • the limits for how long a session should be in a "stuck" state should also be modifiable - see the implementation for culling - this is just an alternative way of culling
  • these stuck states are failed or starting from the choices in the state enum

In a bit more detail (and almost identical to the culling), this would work like this:

  • add a parameter to the "culling" section of the crd that contains the limit of how long a session should be in starting state before it is culled
  • add another section to the culling section of the crd that indicates how long a session can be in the failed state before it is culled
  • in the status section of the jupyterserver manifest add two fields startingSince and failedSince which contains an iso8601 timestamp of the time (in UTC) when the session entered the state
  • update the kopf session state handler to write the two timestamps in the session status
  • add another kopf timer similar to the culling timer that checks status.failedSince on the manifest compares this to culling.maxFailedAge and if conditions are right it removes the server
  • repeat the above for the "starting" state - you can re-use the same kopf timer or add a separate one

Tips:

  • all state (if any is needed) is in the manifest
  • a value of 0 for the culling thresholds means that the session will not be culled for that specific case - it is the same for culling based on idleness