how to monitor worker processes

Question

how to monitor worker processes

daichi5 opened this issue 2 years ago · comments

We plan to use 'delayed' as an asynchronous processing worker on Kubernetes pods.
So we need a way to do healthcheck of the worker process and we have the following method to do the healthcheck now.

livenessProbe:
    exec:
      command:
        - bash
        - -lc
        - '[[ $(ps aux | grep delayed | grep -v grep | wc -l) > 0 ]]'

However, if there is a better method, we would like to adopt it.
So, do you have any other methods?
I'd like to know it for reference because I saw that you use kubertenes in other issues.

Nathan Griffith · Answer 1 · Tue Nov 29 2022 02:15:24 GMT+0800 (China Standard Time)

Hi @daichi5! We currently don't run our worker pods with a readinessProbe or livenessProbe config, largely because the processes don't accept any outside HTTP traffic, so we don't need the health checks for load balancing purposes. Instead we rely on the default behavior, which is that if the main process (at PID 1) exits, the container restarts.

We also use our cron/scheduler process to enqueue a background job once per minute, and that job emits a metric that we can monitor to alert ourselves if there are no workers running. But this exists outside of our k8s infrastructure, and we haven't shipped a generic version of this behavior, since it depends on the specifics of our internal monitoring/alerting infrastructure.

sawadaichi · Answer 2 · Tue Nov 29 2022 10:44:56 GMT+0800 (China Standard Time)

Hi @smudge! Thanks for the response.
I understand how you operate worker pods.

Instead we rely on the default behavior, which is that if the main process (at PID 1) exits, the container restarts.

As you said, health check may not be necessary because the container will be restarted if the main process exits.
However, we have enabled shareProcessNamespace, so our situation may be a little different.

We also use our cron/scheduler process to enqueue a background job once per minute, and that job emits a metric that we can monitor to alert ourselves if there are no workers running

The idea of this monitoring jobs is very helpful.
I think we'll try to use a similar approach to this one to manage worker processes.
Thank you!