Opt-in diagnostic log upon turning unhealthy

Question

Opt-in diagnostic log upon turning unhealthy

jamesross3 opened this issue 4 years ago · comments

Right now, triggering diagnostic logs is a manual process (usually involving getting on the host where the process is running and then pkill -3 <process-name>). The diagnostic log is also time-sensitive (i.e. I want it as soon as I go unhealthy instead of after whatever period of time it takes me to run that pkill command). It'd be nice if we could expose optionality for outputting a diagnostic log as soon as a process's health check returns an error state.

@bmoylan @nmiyake for thoughts
I am happy to tackle implementation if we want to go forward here

Brad Moylan · Answer 1 · Fri Jul 10 2020 04:29:03 GMT+0800 (China Standard Time)

We would probably want some rate limiting to protect against flaky results, but otherwise this sounds reasonable.

At what layer would you add the code? Would this be triggered by w-g-s when the /health endpoint is hit or would it be in one of the source utilities like the reporter?

Brad Moylan · Answer 2 · Fri Jul 10 2020 04:31:53 GMT+0800 (China Standard Time)

Also, would it be opt-in or opt-out? Would it be configurable per-check?

James Ross · Answer 3 · Fri Jul 10 2020 04:38:47 GMT+0800 (China Standard Time)

I'm imagining this being in the same place as we do the "Health status code changed" logic (so here). I am leaning towards starting with opt-in. I am unsure how configurable this should be. On the one hand, I don't really like the idea of enabling consumers to only enable this for some checks, especially given that some processes' health check keys vary across their lifetimes (maybe that's indicative of an abuse of health checks). On the other hand, I like the idea of being able to trigger diagnostic logs only for certain cases.
Something I do not want a diagnostic log for:

checks:
  key1:
    state: ERROR
    message: uhoh
    params:
      error: "request failed: connection refused"

Something I do want a diagnostic log for:

checks:
  key1:
    state: ERROR
    message: health of some job we expect to complete in a certain amount of time
    params:
      error: "job failed to complete: context canceled"

Perhaps this is too much configurability (you suddenly allow developers to check for the presence of specific keys and values in the untyped params map of a health check result), but it does enable us to avoid emitting useless diagnostic logs.

Brad Moylan · Answer 4 · Fri Jul 10 2020 05:00:06 GMT+0800 (China Standard Time)

Yup, makes sense. I think you could implement this with some kind of handler pattern, like

type HealthStatusChangeHandler func(ctx context.Context, prevStatus, currStatus health.HealthStatus)

Then something like:

Track a changeHandlers []HealthStatusChangeHandler on the healthHandlerImpl
- Run these any time checksDiffer
- Convert logIfHealthChanged() into a handler added to list by default
Add a WithHealthStatusChangeHandler to the server builder that appends to the list
Add a ThreadDumpOnError handler somewhere that can be opted in via witchcraft.NewServer().WithHealthStatusChangeHandler(somewhere.ThreadDumpOnError())
- this is how you opt-in to the basic version of what you describe above

Down the road the framework exists if you want to implement arbitrarily-complex handlers in downstream code where the maintenance is not my problem 😄

James Ross · Answer 5 · Thu Jul 16 2020 02:16:50 GMT+0800 (China Standard Time)

changes merged and released in 1.26.1