DBConnection.Ownership: an allowed process exiting causes the owner's connection to be lost

Question

DBConnection.Ownership: an allowed process exiting causes the owner's connection to be lost

jonleighton opened this issue 3 years ago · comments

Jon Leighton commented 3 years ago

If I have two processes:

Process A checks out a connection via DBConnection.Ownership.ownership_checkout/2
Process B is allowed access to A's connection via DBConnection.Ownership.ownership_allow/4

B runs a long-running query, or otherwise makes use of the connection.

For some reason or other, B is then killed. We see a log message like this:

21:56:25.779 [error] Postgrex.Protocol (#PID<0.382.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.428.0> exited

DBConnection disconnected the database connection, even though the process that exited didn't own the connection.

Now Process A is no longer able to use its connection, even though it owns it.

This seems like surprising behaviour to me, but maybe I'm missing something? Maybe it's not possible to prevent this scenario?

I have written a test that demonstrates this problem. It is in the context of a Phoenix/Ecto app, and uses Ecto.Adapters.SQL.Sandbox, but the DBConnection.Ownership calls above are what is happening under the hood.

When I run that test I see the following:

➜ mix test test/connection_test.exs
22:11:54.508 [error] Postgrex.Protocol (#PID<0.380.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.428.0> exited


  1) test connection problem (Test.ConnectionTest)
     test/connection_test.exs:8
     ** (DBConnection.OwnershipError) cannot find ownership process for #PID<0.426.0>.

     When using ownership, you must manage connections in one
     of the four ways:

     * By explicitly checking out a connection
     * By explicitly allowing a spawned process
     * By running the pool in shared mode
     * By using :caller option with allowed process

     The first two options require every new process to explicitly
     check a connection out or be allowed by calling checkout or
     allow respectively.

     The third option requires a {:shared, pid} mode to be set.
     If using shared mode in tests, make sure your tests are not
     async.

     The fourth option requires [caller: pid] to be used when
     checking out a connection from the pool. The caller process
     should already be allowed on a connection.

     If you are reading this error, it means you have not done one
     of the steps above or that the owner process has crashed.

     See Ecto.Adapters.SQL.Sandbox docs for more information.
     code: IO.inspect(Posts.list_posts())
     stacktrace:
       (ecto_sql 3.6.2) lib/ecto/adapters/sql.ex:757: Ecto.Adapters.SQL.raise_sql_call_error/1
       (ecto_sql 3.6.2) lib/ecto/adapters/sql.ex:693: Ecto.Adapters.SQL.execute/5
       (ecto 3.6.2) lib/ecto/repo/queryable.ex:224: Ecto.Repo.Queryable.execute/4
       (ecto 3.6.2) lib/ecto/repo/queryable.ex:19: Ecto.Repo.Queryable.all/3
       test/connection_test.exs:20: (test)



Finished in 0.07 seconds (0.07s async, 0.00s sync)
1 test, 1 failure

Randomized with seed 415869

Some additional context

I experienced this problem in the real world via a flakey test that does full-stack browser testing via Wallaby. The test runs with async: true, and the database connection owned by the test process is shared with the web endpoint via Phoenix.Ecto.SQL.Sandbox.

Here's roughly what I think was happening when the test failed:

My test caused the browser to load a page
JS code on the page began an asynchronous request to the server
My test caused the browser to navigate to a different page
The browser aborted the in-flight asynchronous request
The server detected that the (HTTP) connection had been closed, and so shut down the process tree associated with that request
This caused the problem above: the database connection was disconnected by a process that didn't own it exiting

José Valim · Answer 1 · Tue Aug 17 2021 21:13:46 GMT+0800 (China Standard Time)

Hi @jonleighton, how is it going? :D

Yes, this can happen in cases like above because, if you crash while using the connection, then we don't know what is the state in a connection? Did you receive part of the select? Are you in a transaction? Etc. Therefore all we can do is abort.

The scenario you described for wallaby can definitely happen and likely what is happening. Cowboy will send an exit signal to the request process if the connection terminates. You can however stop this by calling Process.flag(:trap_exit, true). I would recommend adding a plug that runs only in test to your endpoint that:

Check if your are inside a wallaby test
If so, call Process.flag(:trap_exit, true)

That should fix intermittent reproductions. I assume you are using the Plug SQL Sandbox? Perhaps we should make it easy to add so there.

José Valim · Answer 2 · Tue Aug 17 2021 21:31:27 GMT+0800 (China Standard Time)

Can you try this commit and let me know how it goes? phoenixframework/phoenix_ecto@1d8d28a

Jon Leighton · Answer 3 · Wed Aug 18 2021 08:48:36 GMT+0800 (China Standard Time)

Hi @josevalim! I am enjoying being an Elixir programmer these days so thank you 😁

And thanks for pushing that fix to phoenix_ecto, I'm convinced you must be some kind of hyper advanced AI bot with that sort of response time 🤣

In the the real world app that I encountered this problem we have tweaked our flakey test to avoid the problem, and it was also quite rare to get a repro anyway. But I have updated the test app I made to:

It works a charm 🙂 I did have to increase :queue_interval due to the slightly contrived conditions of the test.

So I think we can close this, high fives all around 👏