inadarei / rfc-healthcheck

Health Check Response RFC Draft for HTTP APIs

Home Page:https://inadarei.github.io/rfc-healthcheck/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Do statuses have a 1:1 relationship to HTTP response codes?

JamesUoM opened this issue · comments

I'm currently writing a health check for an application, but I'm a little unclear on the precise relationship between statuses and response codes. The is described as "tightly coupled" but with no further explanation or examples.

Do statuses have a 1:1 relationship to HTTP response codes? I ask as I have cases where it may be useful to have finer grain responses.

pass = 200
warn = 302, a sub-service return a warning state
warn = 307, a sub-service return an error state
fail = 404, running but unavailable
fail = 503, dead

In different scenarios, we may want a load balancer to bracket different response code ranges as healthy eg [200:302] or [200:307]. Assume the load balancer can only monitor the response code. I know we could always write a customizable filter in front of the health check that can decide talked to the load balancer based on the json.

Thanks for the draft is has been most helpful and timely.

No, relationship is not 1:1. It assumes certain ranges correspond to certain status values, but no 1:1 mapping exists. Per specification:

The value of the status field is case-insensitive and is tightly related with
the HTTP response code returned by the health endpoint. For “pass” status,
HTTP response code in the 2xx-3xx range MUST be used. For “fail” status,
HTTP response code in the 4xx-5xx range MUST be used. In case of the “warn”
status, endpoints MUST return HTTP status in the 2xx-3xx range, and
additional information SHOULD be provided, utilizing optional fields of the
response.

Your example warn = 307, a sub-service return an error state is possible because sub-service error-ing out doesn't automatically mean the main service is also error-ing. Ideally, with proper circuit-breaking, sub-service error-ing out should only cause the main service to be in "warn" state, at worst but still reasonably functional.

Some of the shown mappings have serious issues. And the main reason lies in a misunderstanding about what level of information HTTP status codes represent. HTTP status codes mostly represent "wire". But when transmitting health status information over HTTP, one has to treat HTTP as OSI Layer 5 Session, not OSI Layer 7 Application. Trying to convey information about the content of the health resource instead of the wire of the health resource is conflating things in HTTP and may break clients.

Here's the details:

  • 200 does not necessarily mean pass. 200 means that that the resource was found and works. The wire (to /health and back again) works. It says nothing about the downstream.
  • 302 is "Found" (previously "Moved temporarily), which should not really be used anymore by servers, they should explicitly use 303 or 307. However, a client library like java.net.URL, depending on its configuration, and web browsers will expect that a response to a 302, 303, or 307 will have a Location: response header field with a URL to which they will automatically perform the request again. And health end points should be compatible with standard HTTP client libraries and web browsers.
  • The same situation as for 302 also applies to 307.
  • 404 as a response to GET /health means that /health has no mapped resource on the server. Whether that can be mapped to running but unavailable is doubtful, The path to health endpoints is explicitly not standardized. There may be a convention to often expect it at /health, but aggregation servers without explicit health implementation may just forward the health endpoints of their downstream servers, and because they are many, must use different endpoints.
  • 503 means dead, but careful, it may also mean that the load balancer is dead, while the service is working. It definitely means that something is fishy and one of the elements in the wire is broken on the level of HTTP, but we can't conclude anything else from it. In case a server which provides health information replies with 503, it should mean that the server currently cannot provide health information at all. It would be surprising to see this generated on the level of a health endpoint implementation, this would normally be generated by the server library itself or the router.

I hope that this explains why I think that mapping health status to response codes is not a good idea.

@christianhujer you are disagreeing with the wrong part of the specification. It's not about just some of the values of the mappings, you are questioning the entire approach of using response code of a single endpoint as an indicator of health for the entire API (several endpoints forming a logical "API" that the health endpoint represents). This is explicitly documented in RFC:

A health endpoint is only meaningful in the context of the component it indicates the health of. It has no other meaning or purpose. As such, its health is a conduit to the health of the component. Clients SHOULD assume that the HTTP response code returned by the health endpoint is applicable to the entire component (e.g. a larger API or a microservice). This is compatible with the behavior that current infrastructural tooling expects: load-balancers, service discoveries and others, utilizing health-checks.

Yes, it is s true that status codes in HTTP per se represent the information about the specific URI and don't act as stand-ins for other resources at other URIs. However (!) and this is critical: doing so in the case of health endpoints has been a common practice for decades. For the sake of acknowledging reality and backwards compatibility with the existing tooling, the responsible thing for this RFC is to follow the established pattern.

As a matter of fact, acknowledging this existing pattern is the only reason this RFC even discusses HTTP response codes. In its pure form this RFC is about message format and has nothing to do with HTTP or response codes (it's totally fine to use it with TCP, for instance). But when you have clearly existing industry behavior, we felt it would have been a mistake of omission to not acknowledge it in this RFC. Practicality and backwards compatibility > theoretical purity.

I think the spec is fine. Infering from GET /health ⇒ 200 that the entire component is fine, and it's also fine that GET /health ⇒ 500 means that there are issues, and that the body has to be inspected to find out on which level the issues are. After all, /health is a meta-resource to provide information about an entire component, thus, some inference of status codes on the entire component is valid.

Just reusing 302 or 307 (redirections) as warnings, that will break HTTP clients and spell all sorts of trouble.

OK

Not sure what is special about [302,307]. In this spec "warning" is defined as "pass, but things are getting worse so pay attention", so http response code-wise "pass" and "warning" are the same thing, only in the message do we differentiate.

Given that, if 302,307 are ok for "pass", why are you concerned for "warning"? Warning is not defined as a "light error".

I am not sure I follow where 302,307 is a concern.

Thank you.

From what I understand of this conversation, it seems that the concern being raised is whether or not this spec implies that 302 and 307 have additional semantics which an implementation needs to take into account. If an HTTP client used in an implementation is configured to automatically follow redirects, the health-check implementation might not get an opportunity to respond to a 302 or 307 and interpret it according to this specification.

I guess the thing which may be missing is how the usual semantics of redirects from a /health endpoint would impact the response code semantics defined in this spec.

If GET /health responds with 302 or 307, client libraries will throw errors or exceptions. The expectation according to HTTP for a response that has status code 302 or 307 is that the response includes a Location: header with a new URL, and that the client is to repeat the request to that URL. A lot of client libraries do this per default, without the programmer having to take care of this, because redirects are so common.

In general, any "re-interpretation" of existing response codes by a new spec that is incompatible with the actual specification in HTTP/1.1 or HTTP/2 risks breaking existing user agents and makes the lives of programmers who would expect that they can simply use normal HTTP clients to access /health endpoints for writing health aggregators, health monitors and so on unnecessarily more difficult.

What exactly makes "302 Found" or "307 Temporary Redirect" good candidates to indicate status: warn?

As a resolution, this spec could just explicitly require that all responses should otherwise comply with the HTTP specifications (this spec may already do this--I have not confirmed). This would imply the Location header would need to be included in case the response code is 302 or 307.

Thanks for all the feedback, but I feel the conversation has been distracted with the example status codes I used (granted that was my fault). But I'm and still unclear how the codes map to the textual statuses. Maybe further examples will clarify what I mean:

  1. "Healthy true or false"
    pass : 20x
    warn : 20x see json for details
    fail : 50x

  2. distinct codes per status
    pass : 20x
    warn : 30x see json for details
    fail : 50x

  3. codes per degree of warning, fail
    pass : 20x
    warn : 30x something might be wrong
    warn : 30x+1 something a bit worse
    fail : 40x up but can't respond
    fail : 50x not working at all

IMHO clearly not 2 or 3. That's not how HTTP works, and the spec has to be compatible with HTTP.

307 Temporary Redirect seems just fine provided a redirect location is actually used that returns something meaningful. I don't see any conflict in letter or spirit to the HTTP spec.

We do redirects precisely because the server is asked for resources it can't faithfully provide per how it defines those resources. The server can decide what criteria justifies the existence of the /health resource. When those criteria aren't met, the server can say "well you wanted the good news, but I'm redirecting you to the not so good news". Redirect says the resource requested isn't available, but a viable substitute is.

For example, GET /health might redirect to /health_warnings. A client who wishes to know what those health_warnings are could then follow the redirect and get an application/health+json document. Since detailing the degraded health might be expensive, a polite client that doesn't need that information might decide not to follow the redirect.

202 Accepted is also interesting if you want to differentiate from 200 OK for good health. Differentiating by response code makes it really convenient to use monitoring tools to compute "good" state availability and not just the "up" state availability.

We've had this discussion in our team. In a nutshell, we're seeing /health as more of a business layer endpoint.

We do not agree with fail being in the 4xx-5xx range because:

  • 4xx errors indicate client side errors => out of scope.
  • out of the 5xx, we found that no error is really relevant to a fail and could be mistaken for a load balancer or server down issue.

For us, 200 seemed like better candidate for all statuses:

  • The result would just be part of the JSON (parsing JSON nowadays is a piece of cake).
  • 200 indicates that our health check endpoint is up and has served our request correctly.