Ownership manager switching to manual after client (not owner) exiting

Question

Ownership manager switching to manual after client (not owner) exiting

michallepicki opened this issue 3 years ago · comments

I have a Phoenix + Absinthe application with a React+Apollo frontend that we run integration tests on using ExUnit and Wallaby. Tests are unfortunately "flaky" because of a random DBConnection.OwnershipError and I think this could be an issue in db_connection.

Because we didn't manage yet to set up all processes to automatically checkout properly in sandbox (like e.g. some Absinthe PubSub processes), all integration tests are async: false so we're using shared mode, using start_owner! in setup as documented in phoenix master:

  setup tags do
    pid = Ecto.Adapters.SQL.Sandbox.start_owner!(DB.Repo, shared: not tags[:async])
    metadata = Phoenix.Ecto.SQL.Sandbox.metadata_for(DB.Repo, pid)
    {:ok, session} = Wallaby.start_session(metadata: metadata)

    on_exit(fn ->
      Wallaby.end_session(session)
      Ecto.Adapters.SQL.Sandbox.stop_owner(pid)
    end)

    {:ok, session: session}
  end

I logged the test's self() PID which is: #PID<0.2936.0>
and the owner (pid in the above snippet) is #PID<0.2937.0>

There's a lot happening when the test is clicking quickly through the app, and sometimes this log shows up, which is (I think?) harmless:

[error] Postgrex.Protocol (#PID<0.2317.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.3246.0> exited

or at least I don't see any other error that would suggest we have a bug in our app that causes it. The Ecto sandbox docs explain that the owner exiting could cause problems, and here the client exits.

This seems fine but after that error I found that the DBConnection.Ownership.Manager receives this message:

{:DOWN, #Reference<0.233185003.3586654209.205229>, :process, #PID<0.2938.0>, {:shutdown, %DBConnection.ConnectionError{message: "client #PID<0.3246.0> exited", reason: :error, severity: :error}}}

so it runs this code into this code and switches from mode {:shared, #PID<0.2937.0>} to :manual despite the fact that it's not the owner process that was just downed.

Afterwards this log is shown for some next request:

11:44:07.613 [error] #PID<0.3263.0> running UI.Endpoint (connection #PID<0.2964.0>, stream id 7) terminated
Server: localhost:5000 (http)
Request: POST /some/graphql
** (exit) an exception was raised:
    ** (DBConnection.OwnershipError) cannot find ownership process for #PID<0.3263.0>.

When using ownership, you must manage connections in one
of the four ways:

* By explicitly checking out a connection
* By explicitly allowing a spawned process
* By running the pool in shared mode
* By using :caller option with allowed process

The first two options require every new process to explicitly
[...]

and the Wallaby test fails trying to parse the phoenix error page as json.

Question

When mode is shared, shouldn't the unshare pattern match be based on the owner process PID and not mode_ref?

José Valim · Answer 1 · Thu Jul 15 2021 19:05:16 GMT+0800 (China Standard Time)

The only process that we monitor in the manager is the owner process, so if you are going into that branch, is because the owner process is terminating too.

For now, I think that makes sense and here is why: a process exited while it was using the connection. Because of this, we have no idea what is the connection state. Maybe it wrote some bytes to it? Maybe there is a left over on the writer buffer? We have no other option other than close the connection and kill its owner, because we can't do anything else with it reliably.

Michał Łępicki · Answer 2 · Thu Jul 15 2021 19:07:39 GMT+0800 (China Standard Time)

Right, so it's correct that the ref is used, makes sense, thank you! I should look into why the request handling process exited. Unfortunately I don't see any reason in logs, I hope it's not something like cowboy killing the process because the browser aborted request...

Again, thanks!

José Valim · Answer 3 · Thu Jul 15 2021 19:22:38 GMT+0800 (China Standard Time)

I hope it's not something like cowboy killing the process because the browser aborted request...

It might be the case... but also note it has to happen while the client is actively using the connection. I.e. doing a query or inside a transaction. You can try doing a Process.flag(:trap_exit, true) in the request process and see how it changes things.

Michał Łępicki · Answer 4 · Thu Jul 15 2021 19:33:32 GMT+0800 (China Standard Time)

It does look like this is the case. I added additional slow SQL queries to the graphql query in which the request handling process was usually exiting and I can now reliably fail the test. So to "fix" it I probably need to wait for the page to become stable and finish receiving responses for the not-critical graphql requests as well.

Integration testing is hard, maybe there's room for improvement here to configure cowboy so that it doesn't kill the process for aborted request so abruptly, but that seems outside of ecto ecosystem.

Jon Leighton · Answer 5 · Wed Aug 18 2021 08:51:15 GMT+0800 (China Standard Time)

Hi @michallepicki, I found this issue after I filed a very similar report: #247

@josevalim has now committed a fix to Phoenix.Ecto.SQL.Sandbox which does the :trap_exit dance automatically: phoenixframework/phoenix_ecto@1d8d28a