elixir-ecto / db_connection

Database connection behaviour

Home Page:http://hexdocs.pm/db_connection/DBConnection.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Telemetry metadata does not match docs

gvaughn opened this issue · comments

The docs for start_link has a section on the Telemetry payload
https://hexdocs.pm/db_connection/2.4.1/DBConnection.html#start_link/2-telemetry
which says the metadata should contain a pool. I was interested in using that to identify which Repo is giving connection errors so I could put a circuit breaker on it (we have a primary and followers).

I attached to [:db_connection, :connection_error] with a function that just inspects the binding and got this when I shut down my local postgres:

  connection error *********************************: [
  measurements: %{count: 1},
  metadata: %{
    error: %DBConnection.ConnectionError{
      message: "connection not available and request was dropped from queue after 2701ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:\n\n  1. Ensuring your database is available and that you can connect to it\n  2. Tracking down slow queries and making sure they are running fast enough\n  3. Increasing the pool_size (albeit it increases resource consumption)\n  4. Allowing requests to wait longer by increasing :queue_target and :queue_interval\n\nSee DBConnection.start_link/2 for more information\n",
      reason: :queue_timeout,
      severity: :error
    },
    opts: []
  },
  attach_opts: []
  ]

I'm not sure if the intent of the project is to provide the pool (also connection_listeners is documented and missing, but I'm not focused on that for my current feature) in the telemetry metadata, or whether the docs should be changed so no one expects it to be there.

In a more general sense, I'd like to ask advice on the best approach I could use to set up a circuit breaker so that if our follower database Repo is having connection issues, we could detect this and perform our circuit breaking logic. I had hoped a telemetry listener could use that pool to correlate to a Repo. Perhaps there's something at the Postgrex or Ecto layer I can use that I have overlooked.

@gvaughn can you do me a quick favor. Change this line on Ecto SQL:

https://github.com/elixir-ecto/ecto_sql/blob/4013ecbea6f6f68bafdba4268e4c3fc99765d05b/lib/ecto/adapters/sql.ex#L608

To do this:

meta = %{telemetry: telemetry, sql: connection, opts: [repo: config[:repo]] ++ opts}

And let me know if the repo will appear under opts.

Thanks for the speedy reply @josevalim !

That still leaves the opts in the metadata as an empty list. I'm going to continue to trace through the code, but wanted to report that for now.

@gvaughn that's very weird. If you can indeed find out where we are losing the options, it would be very helpful.

I tried to reproduce it and got the metadata in a different shape from the documented one, but got the pool info. I'm using postgrex.

[
  {[:db_connection, :connection_error], %{count: 1},
   %{
     error: %DBConnection.ConnectionError{
       message: "connection not available and request was dropped from queue after 2319ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:\n\n  1. Ensuring your database is available and that you can connect to it\n  2. Tracking down slow queries and making sure they are running fast enough\n  3. Increasing the pool_size (albeit it increases resource consumption)\n  4. Allowing requests to wait longer by increasing :queue_target and :queue_interval\n\nSee DBConnection.start_link/2 for more information\n",
       reason: :queue_timeout,
       severity: :error
     },
     opts: [
       log: #Function<14.83435571/1 in Ecto.Adapters.SQL.with_log/3>,
       source: "my_schema",
       timeout: 15000,
       pool_size: 10,
       pool: DBConnection.ConnectionPool
     ]
   }, nil}
]

Edit: I'm using ecto, ecto_sql and postgrex on master, and db_connection 2.4.0 (same result with 2.4.1)

Thank you @v0idpwn . That's interesting. As I did some tracing yesterday, I noticed that the codebase I'm working on has a custom init/2 callback in the Repo. I wonder if that is a factor. In any case, if I could get your result, it won't help my larger concern of correlating the telemetry event with a specific Repo.

The rest of this week is a US holiday, so I'm not expecting to have a lot more time to devote to this. If this issue should be closed so it doesn't sit idle for administrative purposes, please feel free.