Opt-in diagnostic log upon turning unhealthy
jamesross3 opened this issue · comments
Right now, triggering diagnostic logs is a manual process (usually involving getting on the host where the process is running and then pkill -3 <process-name>
). The diagnostic log is also time-sensitive (i.e. I want it as soon as I go unhealthy instead of after whatever period of time it takes me to run that pkill command). It'd be nice if we could expose optionality for outputting a diagnostic log as soon as a process's health check returns an error state.
@bmoylan @nmiyake for thoughts
I am happy to tackle implementation if we want to go forward here
We would probably want some rate limiting to protect against flaky results, but otherwise this sounds reasonable.
At what layer would you add the code? Would this be triggered by w-g-s when the /health endpoint is hit or would it be in one of the source utilities like the reporter?
Also, would it be opt-in or opt-out? Would it be configurable per-check?
I'm imagining this being in the same place as we do the "Health status code changed" logic (so here). I am leaning towards starting with opt-in. I am unsure how configurable this should be. On the one hand, I don't really like the idea of enabling consumers to only enable this for some checks, especially given that some processes' health check keys vary across their lifetimes (maybe that's indicative of an abuse of health checks). On the other hand, I like the idea of being able to trigger diagnostic logs only for certain cases.
Something I do not want a diagnostic log for:
checks:
key1:
state: ERROR
message: uhoh
params:
error: "request failed: connection refused"
Something I do want a diagnostic log for:
checks:
key1:
state: ERROR
message: health of some job we expect to complete in a certain amount of time
params:
error: "job failed to complete: context canceled"
Perhaps this is too much configurability (you suddenly allow developers to check for the presence of specific keys and values in the untyped params map of a health check result), but it does enable us to avoid emitting useless diagnostic logs.
Yup, makes sense. I think you could implement this with some kind of handler pattern, like
type HealthStatusChangeHandler func(ctx context.Context, prevStatus, currStatus health.HealthStatus)
Then something like:
- Track a
changeHandlers []HealthStatusChangeHandler
on the healthHandlerImpl- Run these any time
checksDiffer
- Convert
logIfHealthChanged()
into a handler added to list by default
- Run these any time
- Add a
WithHealthStatusChangeHandler
to the server builder that appends to the list - Add a
ThreadDumpOnError
handler somewhere that can be opted in viawitchcraft.NewServer().WithHealthStatusChangeHandler(somewhere.ThreadDumpOnError())
- this is how you opt-in to the basic version of what you describe above
Down the road the framework exists if you want to implement arbitrarily-complex handlers in downstream code where the maintenance is not my problem 😄
changes merged and released in 1.26.1