Proposal: Agent healthcheck endpoint to advance production-readiness of deployed agents

Question

Proposal: Agent healthcheck endpoint to advance production-readiness of deployed agents

jakedoublev opened this issue 7 months ago · comments

Is your feature request related to a problem? Please describe.
As it stands, there is currently no universal standard in the spec for liveness/readiness of an Agent running in the context of a system. Until I receive an vendor-Agent/model/custom-Agent bad response to a task-related request, I don't know there are problems with my Agent (within the context of the spec).

At a high level, I believe an open protocol capably applied across all Agent implementations should treat an Agent as any other service within a stack and consider things like integrating observability, events/metrics, shutdown, etc, but this proposal is limited in scope to a binary health/unhealth discovery implementation.

Describe the solution you'd like
Kubernetes actually has a great solution in the form of two health check probes, liveness and readiness. In such context, the liveness healthcheck returns either a 200 or unhealthy HTTP status like 400/500 and indicates that the service is alive, and the readiness healthcheck does the same but ensures all dependencies are also alive (such as connection to a database).

An Agent that depends on connection to a single model is arguably not dependent upon any external resources as it could be considered unalive/not healthy if there is no connection to its single LLM, but I think we could easily see a future where a single orchestrator Agent is facilitating Agent interactions between multiple models, and a truly universally applied spec needs to consider such circumstances.

An endpoint /ap/v1/agent/health_check would be a good place to capture health-related inquiries and I'd love to hear more discussion from there about:

Whether liveness_and_ readiness are both requirements
Whether it needs to be a concern of the spec beyond providing a dedicated endpoint to query health and can be left as an Agent-implementation concern from there

I think an implementation that lacks the ability to decipher if a non-200 response is due to a failure of my Agent to start or a bug in my implementation is a poor developer experience, so Agent-implementations will solve for this on their own and will undoubtedly differ in their implementations without a protocol-driven spec.

Describe alternatives you've considered
The alternative is mostly just not providing a way to query an Agent's healthy state, which is the current status of the Agent Protocol. Agent health is the responsibility of the Agent-specific implementation and not of the protocol, which leads to a lack of consistency and will promote vendor lock-in should Agents evolve to 3rd party SaaS tooling.

Additional context
It could be worthwhile discussion to extend this conversation to things like Agent deployment versioning /ap/v1/agent/version and other deployment state/service context discovery as well, but again, this proposal is limited in scope to solely a health check.

J. Zane Cook · Answer 1 · Thu Oct 12 2023 03:40:31 GMT+0800 (China Standard Time)

I'm curious what your thoughts are on the info endpoint, I posted a potential schema for it here in Issue #39.

It didn't go over a health check for readiness and liveliness, but I think that it could be added there technically. Or some other type of standardized status message.

The schema we were considering does include a version for both the Agent Protocol and the Agent itself, which would be useful for clients.

Jake Van Vorhis · Answer 2 · Thu Oct 12 2023 07:50:18 GMT+0800 (China Standard Time)

That's an interesin

I'm curious what your thoughts are on the info endpoint, I posted a potential schema for it here in Issue #39.

It didn't go over a health check for readiness and liveliness, but I think that it could be added there technically. Or some other type of standardized status message.

The schema we were considering does include a version for both the Agent Protocol and the Agent itself, which would be useful for clients.

That's an interesting proposal @jzanecook. I have some thoughts about that info proposal that are distinctly different form this one, but I believe they are and should be separate concerns. I think it's worthwhile deciding earlier rather than later how prescriptive the agent protocol needs to be. For what it's worth, I prefer open/extensible protocols rather than restrictive ones, and I also believe a mechanism in the protocol for Agent metadata is valuable but is a separate concern from liveness or readiness.

I would find a lot of value in being able to GET liveness with a 200 response as good enough to detect liveness of an Agent. I think the readiness check is distinctly different because it indicates that dependent resources are also alive and responsive. Agent metadata is also valuable but a GET for info/metadata should be different than a GET for a factor of health. My preference in RESTful architectures (since this protocol appears to assume HTTP so far) is always to keep separate endpoints for separate resources.