mtrudel / bandit

Bandit is a pure Elixir HTTP server for Plug & WebSock applications

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Missing measurements in telemetry events

danschultzer opened this issue · comments

Hey @mtrudel,

We're looking to update opentelemetry_bandit to support for Bandit 1.4 breaking changes in open-telemetry/opentelemetry-erlang-contrib#311 and measurements are missing from some [:bandit, :request, :stop] telemetry events.

This happened for 404s with a Phoenix app.

Looking up Bandit.Telemetry.stop_span calls there are two places where measurements are not set:

Bandit.Telemetry.stop_span(span, %{}, %{error: error.message})

Bandit.Telemetry.stop_span(span, %{}, %{error: error.message})

In't this incorrect as Bandit.Telemetry explicitly states that there will be the above measurements in [:bandit, :request: stop] events?

* `[:bandit, :request, :stop]`
Represents the end of the span
This event contains the following measurements:
* `monotonic_time`: The time of this event, in `:native` units
* `duration`: The span duration, in `:native` units
* `req_header_end_time`: The time that header reading completed, in `:native` units
* `req_body_start_time`: The time that request body reading started, in `:native` units.
* `req_body_end_time`: The time that request body reading completed, in `:native` units
* `req_body_bytes`: The length of the request body, in octets
* `resp_start_time`: The time that the response started, in `:native` units
* `resp_end_time`: The time that the response completed, in `:native` units
* `resp_body_bytes`: The length of the response body, in octets. If the response is
compressed, this is the size of the compressed payload as sent on the wire
* `resp_uncompressed_body_bytes`: The length of the original, uncompressed body. Only
included for responses which are compressed
* `resp_compression_method`: The method of compression, as sent in the `Content-Encoding`
header of the response. Only included for responses which are compressed
This event contains the following metadata:
* `telemetry_span_context`: A unique identifier for this span
* `connection_telemetry_span_context`: The span context of the Thousand Island `:connection`
span which contains this request
* `conn`: The `Plug.Conn` representing this connection. Not present in cases where `error`
is also set and the nature of error is such that Bandit was unable to successfully build
the conn
* `error`: The error that caused the span to end, if it ended in error

I haven't verified whether the 404 is getting into the latter exception handling, but I suspect it does.

Actually we don't see the exceptions happening in our use case. It's weird, it seems that the conn/conn.adapter doesn't persist with the measurement, even though send_resp gets called so it should have been set. It happens for the 404, but haven't had the time to grok the whole code flow. Bottomline, when stop_span gets called in the Bandit.Pipeline the measurement is lost.

The error case1 telemetry events won't have many / most of the measurements, this is ~more or less by design since it simplifies the mainline code so much. That being said, there should be some there; this may well be a bug.

Footnotes

  1. Error case meaning anything that Bandit returns internally (stream errors, semantic errors, etc). Any response that your plug sets on its own is considered 'normal' by Bandit, regardless of its response code.

Yeah, it makes sense there is not really any measurements for errors since it means it halted somewhere halfway so what are those measurements even good for. But shouldn't that be a [:bandit, :request, :exception] instead of [:bandit, :request, :stop] event then?

In any case that wouldn't address the issue we see in the otel library though, that's a 404 that is handled by Phoenix (no exception raised to Bandit), but somehow the measurements get lost. I'll see if I can make a reproducable example for you when I have some free time.

I'd love a repro for this if possible @danschultzer !

Looking into the pipeline code in more detail now that the post-1.4 world has settled down a bit and I think the only open question may be whether protocol errors should be exposed as a stop event or an exception event here:

Bandit.Telemetry.stop_span(span, %{}, %{error: error.message})

I'm open to either option, and am happy to take guidance from the telemetry folks on this

Closing due to inactivity. As I mentioned above, I'm happy to hew telemetry for these cases based on whatever otel folks want. Once / if someone has a concrete proposal just open a ticket up and I'll do my best to make it happen