oracle / weblogic-kubernetes-operator

WebLogic Kubernetes Operator

Home Page:https://oracle.github.io/weblogic-kubernetes-operator/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Feature Request: startup probes

belfo opened this issue · comments

Would be nice to have (next to readinessProbe and livenessProbe probe) the startup probes.
https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-startup-probes

Hi @belfo,

Thanks for reaching out. We had considered adding a startup probe but hadn't thought that it was necessary because our current liveness probe is designed to succeed while a WebLogic Server instance is starting.

While the readiness probe is an HTTP probe that attempts to connect to the standard WebLogic "ReadyApp" endpoint, the liveness probe is instead a small script that is executed in the container.

This script validates that the node manager process is running and that the node manager is reporting some other WebLogic Server state other than FAILED_NOT_RESTARTABLE. This means that the liveness probe will pass while the state is still STARTING.

Or, do you have a different intention such as wanting the liveness probe to fail if the server instance isn't in the Ready condition after some timeout?

In fact i had to put a very large initialDelaySeconds on the liveness probe as it was killing the container before weblogic was starting (still in the operator part), so it was restarting continuously (this is clearly caused by another issue in our cluster who is too slow)
The startup at least will avoid this as it won't kill the container.

My solution works (for me) but then it could take more time to be ready once the slowness is fixed.

Do you happen to have any logs from the servers that were killed prior to your setting the large initialDelaySeconds?

I haven't kept logs but what i have is the exit code:

Containers:
weblogic-server:
Container ID: containerd://f2baeb6839bb88e66448d84dcc7baddf862801caa665f598542573ad61538d6a
Image: nexus.priv:9003/middleware/weblogic/wls:12.2.1.4.0.bcprov.vap.patch2
Image ID: nexus.itsmtaxud.priv:9003/middleware/weblogic/wls@sha256:f2af756c4df2358f27cb8f51c8469d83bea4594a6a5ec95be830d5d837f086ed
Port: 7000/TCP
Host Port: 0/TCP
Command:
/weblogic-operator/scripts/startServer.sh
State: Running
Started: Fri, 16 Sep 2022 10:21:06 +0200
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Fri, 16 Sep 2022 10:19:16 +0200
Finished: Fri, 16 Sep 2022 10:21:04 +0200
Ready: False
Restart Count: 5

And the events:
17h Normal Started pod/host-admin-server Started container weblogic-server
84m Warning Unhealthy pod/host-admin-server Readiness probe failed: Get "http://192.168.2.32:7000/weblogic/ready": dial tcp 192.168.2.32:7000: connect: connection refused
17h Warning Unhealthy pod/host-admin-server Liveness probe failed: @[2022-09-15T16:21:18.419112699Z][livenessProbe.sh:77][SEVERE] WebLogic NodeManager process not found.
10h Normal Killing pod/host-admin-server Container weblogic-server failed liveness probe, will be restarted
17h Warning FailedPreStopHook pod/host-admin-server Exec lifecycle hook ([/weblogic-operator/scripts/stopServer.sh]) for Container "weblogic-server" in Pod "host-admin-server_dev(040c8163-7417-4bc4-ae2e-283d470e2707)" failed - error: command '/weblogic-operator/scripts/stopServer.sh' exited with 1: /weblogic-operator/scripts/stopServer.sh: line 22: /u01/domains/base_domain/servers/admin-server/logs/admin-server.stop.out: No such file or directory

And the pod logs was showing
up to [FINE] Exiting encrypt_decrypt_domain_secret

By setting the following high value, i was able to start (i haven't test any finetuning, just put some big value to be sure, but took around ~5min to have the weblogic starting)
serverPod:
livenessProbe:
initialDelaySeconds: 900
periodSeconds: 120
timeoutSeconds: 60
failureThreshold: 5
readinessProbe:
initialDelaySeconds: 300
periodSeconds: 90
timeoutSeconds: 60
failureThreshold: 5

It is unusual to see '[livenessProbe.sh:77][SEVERE] WebLogic NodeManager process not found'. The NM is started very soon after the pod starts and before WebLogic Server itself is started, and we (or at least I) have yet to see an example of it crashing. It could be that the pod is somehow taking a very long time to get to the point where it starts an NM, or that the exit information ^^^ is misleadingly reflecting what's happening as the pod is forced to shut down due to the timeout (which I assume would in turn bring down the NM).

I agree with @rjeberhard that a pod log would be helpful here. I think even a log from a successful run would help, as that should help reveal the timings of the pod's startup activity.

Hello @tbarnes-us
Indeed it's unusual, it was related to resource availability of the underlying K8s nodes.
Once the node got more resources (some logical limitation on the vcenter) all was good.

But the idea of having startup probes still make sense on my point of vue.
Worst case they are useless, best case they can at least prevent the pod to restart when not needed.

A reproducer pod log would help evaluate the idea - e.g. whether the startup probe would help in the first place, and, if so, how the startup probe would need to be coded...

Overall, it is strongly recommended to tune the pod's CPU & memory asks so that at least no attempts are made to start without a sufficient amount of those two resources (see the FAQ). Do you know what the 'logical limitation on the vcenter' was?

I have no more the issue so hard to reproduce.
But the weblogic was not yet started (probably the same check as liveness could be enought?) with the advantage that it will not kill the pod until it's marked as started.
The limitation was around 20GB ram & GHz for all the node.... we have 8 nodes. So clearly not enough.

@rjeberhard @belfo

I recommend closing this Issue because the root problem (lack of allocated resources) has been fixed, and, in my opinion, a wide variety of failures and retries are expected when too few resources are allocated. It can be revisited if/when the problem is reproduced with sufficient data for a full diagnosis (pod logs, etc).

Thoughts?