Telemetry metadata does not match docs

Question

Telemetry metadata does not match docs

gvaughn opened this issue 3 years ago · comments

The docs for start_link has a section on the Telemetry payload
https://hexdocs.pm/db_connection/2.4.1/DBConnection.html#start_link/2-telemetry
which says the metadata should contain a pool. I was interested in using that to identify which Repo is giving connection errors so I could put a circuit breaker on it (we have a primary and followers).

I attached to [:db_connection, :connection_error] with a function that just inspects the binding and got this when I shut down my local postgres:

  connection error *********************************: [
  measurements: %{count: 1},
  metadata: %{
    error: %DBConnection.ConnectionError{
      message: "connection not available and request was dropped from queue after 2701ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:\n\n  1. Ensuring your database is available and that you can connect to it\n  2. Tracking down slow queries and making sure they are running fast enough\n  3. Increasing the pool_size (albeit it increases resource consumption)\n  4. Allowing requests to wait longer by increasing :queue_target and :queue_interval\n\nSee DBConnection.start_link/2 for more information\n",
      reason: :queue_timeout,
      severity: :error
    },
    opts: []
  },
  attach_opts: []
  ]

I'm not sure if the intent of the project is to provide the pool (also connection_listeners is documented and missing, but I'm not focused on that for my current feature) in the telemetry metadata, or whether the docs should be changed so no one expects it to be there.

In a more general sense, I'd like to ask advice on the best approach I could use to set up a circuit breaker so that if our follower database Repo is having connection issues, we could detect this and perform our circuit breaking logic. I had hoped a telemetry listener could use that pool to correlate to a Repo. Perhaps there's something at the Postgrex or Ecto layer I can use that I have overlooked.

José Valim · Answer 1 · Wed Nov 24 2021 03:03:47 GMT+0800 (China Standard Time)

@gvaughn can you do me a quick favor. Change this line on Ecto SQL:

https://github.com/elixir-ecto/ecto_sql/blob/4013ecbea6f6f68bafdba4268e4c3fc99765d05b/lib/ecto/adapters/sql.ex#L608

To do this:

meta = %{telemetry: telemetry, sql: connection, opts: [repo: config[:repo]] ++ opts}

And let me know if the repo will appear under opts.

Greg Vaughn · Answer 2 · Wed Nov 24 2021 03:25:42 GMT+0800 (China Standard Time)

Thanks for the speedy reply @josevalim !

That still leaves the opts in the metadata as an empty list. I'm going to continue to trace through the code, but wanted to report that for now.

José Valim · Answer 3 · Wed Nov 24 2021 03:34:02 GMT+0800 (China Standard Time)

@gvaughn that's very weird. If you can indeed find out where we are losing the options, it would be very helpful.

felipe stival · Answer 4 · Wed Nov 24 2021 17:35:39 GMT+0800 (China Standard Time)

I tried to reproduce it and got the metadata in a different shape from the documented one, but got the pool info. I'm using postgrex.

[
  {[:db_connection, :connection_error], %{count: 1},
   %{
     error: %DBConnection.ConnectionError{
       message: "connection not available and request was dropped from queue after 2319ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:\n\n  1. Ensuring your database is available and that you can connect to it\n  2. Tracking down slow queries and making sure they are running fast enough\n  3. Increasing the pool_size (albeit it increases resource consumption)\n  4. Allowing requests to wait longer by increasing :queue_target and :queue_interval\n\nSee DBConnection.start_link/2 for more information\n",
       reason: :queue_timeout,
       severity: :error
     },
     opts: [
       log: #Function<14.83435571/1 in Ecto.Adapters.SQL.with_log/3>,
       source: "my_schema",
       timeout: 15000,
       pool_size: 10,
       pool: DBConnection.ConnectionPool
     ]
   }, nil}
]

Edit: I'm using ecto, ecto_sql and postgrex on master, and db_connection 2.4.0 (same result with 2.4.1)

Greg Vaughn · Answer 5 · Thu Nov 25 2021 00:30:18 GMT+0800 (China Standard Time)

Thank you @v0idpwn . That's interesting. As I did some tracing yesterday, I noticed that the codebase I'm working on has a custom init/2 callback in the Repo. I wonder if that is a factor. In any case, if I could get your result, it won't help my larger concern of correlating the telemetry event with a specific Repo.

The rest of this week is a US holiday, so I'm not expecting to have a lot more time to devote to this. If this issue should be closed so it doesn't sit idle for administrative purposes, please feel free.