elixir-ecto / db_connection

Database connection behaviour

Home Page:http://hexdocs.pm/db_connection/DBConnection.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pool inconsistency

sabudaye opened this issue · comments

Hello, a month ago @tverlaan reported an issue elixir-ecto/ecto#2946
and @fishcakez made a fix #191
with this line https://github.com/elixir-ecto/db_connection/blob/master/lib/db_connection/holder.ex#L272
I'm working in same team with @tverlaan, two weeks ago we got a crash caused by DBConnection:
application.log

crash dump doesn't say too much, here is recon report for it:
recon_result.txt

Crash is much better than hanging, but still not really good thing :)
We hoped to collect more data from few more crashes but didn't get any since 11 April. So I think best we can do for now is to create this issue, maybe ecto team can use this information.

Thanks, really appreciated!

Btw, what is your OTP version?

Elixir version: 1.7
Erlang version: 21
Database and version: Postgres 9.4 (with BDR 1.0)
Ecto version: 3.0.6
Ecto SQL version: 3.0.4
Database adapter and version: Postgrex 0.14.1
Operating system: Ubuntu Xenial

We got 4 more crashes in two different applications on 4 different servers connected to same database. I tried to collect as much data as possible but I don't think it can help :(
One application (3 servers):
s1.zip
s2.zip
s3.zip

Other application:
s4.zip

Hi @josevalim / @fishcakez ,

We're still experiencing this issue once a week/two weeks. For different reasons we had to change to MySQL (using Mariaex adapter) and there we still experience the same issue. We would like some help and guidance (or fix 😉) for our problem as we're getting stuck on why we would get a holder which no longer exists. It doesn't seem possible with the fix for checkin, yet it still happens.

I must add that it most frequently happens from our Phoenix.Channels. It could be a coincidence or related to the things we specifically do there, like complex queries, or more queries. At the same time we see that we have a lot of duplicate_join's. That means that some of our processes get often killed while probably interacting with the connection pool. This is just a hunch and I can't confirm it.

So far we haven't been able to reproduce it and would really appreciate if you could look into this or suggest what we can do next.

@tverlaan this is very tricky because I can't see how this error can happen. The only functions that delete the holder (the thing that is missing) are functions in the connection pool itself and every time we do so, we guarantee to not check the holder in. The other way the holder can be deleted is if the connection pool process dies, but that is not the case as it is the process checking out.

I can only think of the possible reasons:

  1. You are using a custom pool or the ownership pool in production? What is the result of MyApp.Repo.config for you? Please remove any credentials.

  2. The same connection is being checked in twice. I can't see this happening in the code either.

@sabudaye @tverlaan I have added a new commit that includes a bit more information when such errors happen. Can you please give master a try and post the updated error once it occurs?

Also, @tverlaan, what is your OTP version?

Thanks for helping us with this!

Here is the info you asked for:

Erlang/OTP 21 [erts-10.0.4] [source] [64-bit] [smp:24:24] [ds:24:24:10] [async-threads:1] [hipe]

Interactive Elixir (1.8.1) - press Ctrl+C to exit (type h() ENTER for help)
iex> MyApp.Repo.config
[
  otp_app: :my_app,
  timeout: 15000,
  pool_size: 10,
  size: 10,
  migration_source: "migrations",
  telemetry_prefix: [:my_app, :repo],
  migration_lock: nil,
  charset: "utf8mb4",
  hostname: "127.0.0.1",
  port: "3306",
  username: "username",
  password: "password",
  database: "myapp_db"
]

Is it possible that the connection pool restarts and somehow the holder gets into an incorrect state? I haven't checked myself yet, just a random thought. Anyway, we'll try out master and report back.

If the pool restarts we lose the queue and all holders on it, so that shouldn't be the concern point. And to confirm the options you pasted: there is no pool configuration in prod, correct?

Ok, thanks for clearing that up. We don't have any pool configuration.

Hi, @josevalim and @fishcakez ,

After more than two weeks we had this issue again today, twice. I exported error log of one node from first error until node crashed into csv:
db_connection_csv.txt

Example of output with ETS INFO:

Process #PID<0.5297.112> terminating
** (exit) exited in: DBConnection.Holder.checkout(#PID<0.23559.758>, [log: #Function<11.126414906/1 in Ecto.Adapters.SQL.with_log/3>, source: ""some_table"", timeout: 15000, pool_size: 10, migration_lock: nil, pool: DBConnection.ConnectionPool])
    ** (EXIT) an exception was raised:
        ** (ArgumentError) DBConnection.Holder #Reference<0.2018872586.3276814.242729> could not be given away, pool inconsistent.

ETS INFO: [id: #Reference<0.2018872586.3276814.242729>, read_concurrency: false, write_concurrency: false, compressed: false, memory: 319, owner: #PID<0.12644.760>, heir: #PID<0.23559.758>, name: DBConnection.Holder, size: 1, node: :""appname@server"", named_table: false, type: :ordered_set, keypos: 1, protection: :public]

Please report at https://github.com/elixir-ecto/db_connection/issues

           (db_connection) lib/db_connection/holder.ex:244: DBConnection.Holder.handle_checkout/3
            (db_connection) lib/db_connection/connection_pool.ex:50: DBConnection.ConnectionPool.handle_info/2
            (stdlib) gen_server.erl:637: :gen_server.try_dispatch/4
            (stdlib) gen_server.erl:711: :gen_server.handle_msg/6
            (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
    (db_connection) lib/db_connection/holder.ex:137: DBConnection.Holder.checkout/2
    (db_connection) lib/db_connection.ex:1030: DBConnection.checkout/3
    (db_connection) lib/db_connection.ex:1340: DBConnection.run/6
    (db_connection) lib/db_connection.ex:596: DBConnection.execute/4
    (ecto_sql) lib/ecto/adapters/mysql/connection.ex:30: Ecto.Adapters.MySQL.Connection.execute/4
    (ecto_sql) lib/ecto/adapters/sql.ex:569: Ecto.Adapters.SQL.execute!/4
    (ecto_sql) lib/ecto/adapters/sql.ex:551: Ecto.Adapters.SQL.execute/5
    (ecto) lib/ecto/repo/queryable.ex:153: Ecto.Repo.Queryable.execute/4
Initial Call: Phoenix.Channel.Server.init/1
Ancestors: [#PID<0.2894.0>, #PID<0.2881.0>, App.I.Endpoint, App.I.Supervisor, #PID<0.2846.0>]

Hi, I think it might be useful if I post some statistic of how we use database. We do 71k select queries and 3k of other types of queries in a minute, 1200 queries per second.

Hey @sabudaye, not ignoring you just don't have an answer. Ill share my analysis of the log from last week:

When the pool (#PID<0.639.759>) crashes:

GenServer #PID<0.639.759> terminating\n** (ArgumentError) DBConnection.Holder #Reference<0.2018872586.34734089.222731> could not be given away, pool inconsistent.\n\nETS INFO: [id: #Reference<0.2018872586.34734089.222731>, read_concurrency: false, write_concurrency: false, compressed: false, memory: 319, owner: #PID<0.26856.736>, heir: #PID<0.639.759>, name: DBConnection.Holder, size: 1, node: :"appname@server", named_table: false, type: :ordered_set, keypos: 1, protection: :public]\n\nPlease report at https://github.com/elixir-ecto/db_connection/issues"\n\n    (db_connection) lib/db_connection/holder.ex:244: DBConnection.Holder.handle_checkout/3\n    (db_connection) lib/db_connection/connection_pool.ex:50: DBConnection.ConnectionPool.handle_info/2\n    (stdlib) gen_server.erl:637: :gen_server.try_dispatch/4\n    (stdlib) gen_server.erl:711: :gen_server.handle_msg/6\n    (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3\nLast message: {:db_connection, {#PID<0.17184.358>, #Reference<0.2018872586.2139619331.81807>}, {:checkout, [#PID<0.17184.358>], -575906073946, true}}

We see it crashing due to connection holder/ETS table giveaway attempt where the owner of table is not the pool: owner: #PID<0.26856.736>, heir: #PID<0.639.759>. Here the heir is the pool (#PID<0.639.759>), and the owner is another process ( #PID<0.26856.736>).

Once the connection pool crashes we see two processes crash. If we check the last message from the connection pool error:

Last message: {:db_connection, {#PID<0.17184.358>, #Reference<0.2018872586.2139619331.81807>}, {:checkout, [#PID<0.17184.358>], -575906073946, true}}

So the checkout request from #PID<0.17184.358> triggered the crash, this process crashes too:

GenServer #PID<0.17184.358> terminating\n** (stop) exited in: DBConnection.Holder.checkout(#PID<0.639.759>, [log: #Function<11.126414906/1 in Ecto.Adapters.SQL.with_log/3>, binary_as: :field_type_medium_blob, timeout: 15000, pool_size: 10, migration_lock: nil, pool: DBConnection.ConnectionPool])\n    ** (EXIT) an exception was raised:\n        ** (ArgumentError) DBConnection.Holder #Reference<0.2018872586.34734089.222731> could not be given away, pool inconsistent.\n\nETS INFO: [id: #Reference<0.2018872586.34734089.222731>, read_concurrency: false, write_concurrency: false, compressed: false, memory: 319, owner: #PID<0.26856.736>, heir: #PID<0.639.759>, name: DBConnection.Holder, ...

However we also see another process crash at the same time, #PID<0.26856.736>:

GenServer #PID<0.26856.736> terminating\n** (stop) exited in: DBConnection.Holder.checkout(#PID<0.639.759>, [log: #Function<11.126414906/1 in Ecto.Adapters.SQL.with_log/3>, binary_as: :field_type_medium_blob, timeout: 15000, pool_size: 10, migration_lock: nil, pool: DBConnection.ConnectionPool])\n    ** (EXIT) an exception was raised:\n        ** (ArgumentError) DBConnection.Holder #Reference<0.2018872586.34734089.222731> could not be given away, pool inconsistent.\n\nETS INFO: [id: #Reference<0.2018872586.34734089.222731>, read_concurrency: false, write_concurrency: false, compressed: false, memory: 319, owner: #PID<0.26856.736>, heir: #PID<0.639.759>, name: DBConnection.Holder, ...

This is quite interesting because this process, #PID<0.26856.736>, is the owner of the ETS table that caused the initial crash and is requesting a checkout. If other crashes haven't been filtered out, hopefully we are close to the minimal example with 2 client processes already. We see the same pattern with 3 processes in later crashes too.

To shrink this it would be useful to know the order of the requests in the pool from the two client processes. I'll refer to the processes by name: the pool (#PID<0.639.759>), the caller (#PID<0.17184.358>) and the owner (#PID<0.26856.736>).

If the caller's message is handled is before the owner's message in the pool it means that the owner is somehow owner of the connection holder/ETS table at the head of the connection queue, and coincidently also trying a new checkout. This means that the owner must have recently checked in it's ETS table, but not given ownership to the pool. Is it possible checkin but not pass ownership?

The only way to checkin an ETS table is via the ETS-TRANSFER message, which is only triggered by an :ets.give_away, and we also check the pool is the owner in order to enqueue:

def handle_info({:"ETS-TRANSFER", holder, _, {msg, queue, extra}}, {_, queue, _} = data) do
case msg do
:checkin ->
owner = self()
case :ets.info(holder, :owner) do
^owner ->
handle_checkin(holder, extra, data)
:undefined ->
{:noreply, data}

Therefore let's consider the pool starts in good state and then owner's message is handled before the caller's in the pool process. This would imply for the owner message:

  1. pool passed ownership of the holder/ets table to the owner
  2. pool didn't dequeue holder from its queue, or enqueued holder into its queue
  3. owner didn't get sent a ETS-TRANSFER (or at least didn't receive it before the DOWN monitor from the pool)

This is concerning because 1. and 3. should be contradicting each other at the VM level given signal ordering guarantees between 2 processes. However if ignore 3. temporarily, it would mean that :ets.give_away succeeded but we didn't dequeue in our code. Lets check that:

def handle_info({:db_connection, from, {:checkout, _caller, _now, _queue?}} = checkout, ready) do
{:ready, queue, _codel} = ready
case :ets.first(queue) do
{_time, holder} = key ->
Holder.handle_checkout(holder, from, queue) and :ets.delete(queue, key)
{:noreply, ready}

case Holder.handle_checkout(holder, from, queue) do
true when delay < min ->
{:noreply, {:busy, queue, %{codel | delay: delay}}}
true ->
{:noreply, {:busy, queue, codel}}
false ->
dequeue(time, holder, queue, codel)

So our pool logic always checks handle_checkout/3 return, if true dequeues or does not enqueue, otherwise if false it keeps the holder around. handle_checkout/3 is:
def handle_checkout(holder, {pid, mref}, ref) do
:ets.give_away(holder, pid, {mref, ref})
rescue
ArgumentError ->
if Process.alive?(pid) do
raise ArgumentError, no_holder(holder)
else
false
end
end

Therefore if :ets.give_away succeeds we return true, otherwise the client is dying and we get false. So 1. and 2. should not occur, and therefore we don't need to be concerned about contradiction of 1. and 3.

However now I'm stuck because ruled out enqueuing a holder/ETS table without pool owning, and ruled out not dequeuing a holder/ETS table after handing over ownership. Our next step would be to work out if the analysis missed any situations.

Thanks @sabudaye and @tverlaan, I have pushed a new commit that enforces more invariants and prints more information when those scenarios happen here: 29d9fce

Can you please update your checkout and send us the logs once again when it happens? Thank you!

@fishcakez and @josevalim thank you for your help! We are going to deploy it on production asap.
In parallel our team made huge progress in reducing amount of queries we do to the database, so I'm "afraid" this issue might not happen again :)
I'll keep you updated if we get more info.

It suddenly happened again this morning after more than a month since last time. Before error we had just 15 queries in a second to database.

This is an error message we got:

Child DBConnection.ConnectionPool of Supervisor App.Db.Repo terminated
** (exit) an exception was raised:
    ** (ArgumentError) DBConnection.Holder #Reference<0.2182039963.4205445125.212128> does not belong to the giving process, pool inconsistent.
Error happened when attempting to transfer to #PID<0.23874.1375> (alive: true).

SELF: #PID<0.4430.1200>
ETS INFO: [id: #Reference<0.2182039963.4205445125.212128>, read_concurrency: false, write_concurrency: false, compressed: false, memory: 319, owner: #PID<0.17644.1475>, heir: #PID<0.4430.1200>, name: DBConnection.Holder, size: 1, node: :"app_tel@srv-priv", named_table: false, type: :ordered_set, keypos: 1, protection: :public]

Please report at https://github.com/elixir-ecto/db_connection/issues"

        (db_connection) lib/db_connection/holder.ex:244: DBConnection.Holder.handle_checkout/3
        (db_connection) lib/db_connection/connection_pool.ex:50: DBConnection.ConnectionPool.handle_info/2
        (stdlib) gen_server.erl:637: :gen_server.try_dispatch/4
        (stdlib) gen_server.erl:711: :gen_server.handle_msg/6
        (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Pid: #PID<0.4430.1200>
Start Call: Ecto.Repo.Supervisor.start_child({DBConnection.ConnectionPool, :start_link, [{Mariaex.Protocol, [datetime: :structs, repo: App.Db.Repo, otp_app: :app_db, timeout: 15000, pool_size: 10, size: 10, migration_source: "app_db_migrations", telemetry_prefix: [:app_db, :repo], migration_lock: nil, charset: "utf8mb4", hostname: "0.0.0.0", port: "3306", username: "username", password: "password", database: "db", pool: DBConnection.ConnectionPool]}]}, Ecto.Adapters.MySQL, #Reference<0.2182039963.889585673.94389>, %{opts: [timeout: 15000, pool_size: 10, migration_lock: nil, pool: DBConnection.ConnectionPool], sql: Ecto.Adapters.MySQL.Connection, telemetry: {App.Db.Repo, :debug, [:app_db, :repo, :query]}})

Hi @tverlaan and @sabudaye! Do this issue still happen from time to time for you? Especially in more recent Erlang/OTP versions?

Hi @josevalim!

I haven't seen it since September last year.
What we did since then:

  1. changed Repo config with:
     pool_size: 24,
     prepare: :unnamed,
     queue_target: 5000,
     queue_interval: 500,
  1. fixed duplicate joins in Phoenix Channels, so channels are not crashing anymore
  2. reduced amount of db queries 10-15x times from a single node
  3. updated OTP to 22.3.4.12, elixir to 1.8.2
  4. switched to MyXQL from Mariaex
    And a lot more, but I think these 5 are somehow related to the fact that we don't see connection pool issues anymore.

Thanks for all the help!