elixir-grpc / grpc

An Elixir implementation of gRPC

Home Page:https://hex.pm/packages/grpc

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Mint Adapter crashes

rob-brown opened this issue · comments

Describe the bug
I've been testing out GRPC.Client.Adapters.Mint. On almost every service we try it, we get crashes like this:

GenServer #PID<0.5931.0> terminating
** (stop) exited in: GenServer.call(#PID<0.8619.0>, {:consume_response, {:headers, [{"content-type", "application/grpc+proto"}, {"date", "Tue, 06 Feb 2024 21:57:45 GMT"}, {"server", "envoy"}, {"x-envoy-upstream-service-time", "97"}]}}, 5000)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir 1.15.6) lib/gen_server.ex:1074: GenServer.call/3
    (grpc 0.7.0) lib/grpc/client/adapters/mint/stream_response_process.ex:66: GRPC.Client.Adapters.Mint.StreamResponseProcess.consume/3
    (grpc 0.7.0) lib/grpc/client/adapters/mint/connection_process/connection_process.ex:232: GRPC.Client.Adapters.Mint.ConnectionProcess.process_response/2
    (elixir 1.15.6) lib/enum.ex:2510: Enum."-reduce/3-lists^foldl/2-0-"/3
    (grpc 0.7.0) lib/grpc/client/adapters/mint/connection_process/connection_process.ex:190: GRPC.Client.Adapters.Mint.ConnectionProcess.handle_info/2
    (stdlib 4.3.1.2) gen_server.erl:1123: :gen_server.try_dispatch/4
    (stdlib 4.3.1.2) gen_server.erl:1200: :gen_server.handle_msg/6
    (stdlib 4.3.1.2) proc_lib.erl:240: :proc_lib.init_p_do_apply/3
Last message: {:tcp, #Port<0.160>, <<0, 0, 32, 1, 4, 0, 0, 0, 21, 136, 204, 97, 150, 223, 105, 126, 148, 3, 138, 97, 44, 106, 8, 2, 105, 65, 6, 227, 110, 220, 105, 181, 49, 104, 223, 203, 127, 1, 2, 57, 55, 0, 0, 128, 0, 0, 0, ...>>}

To Reproduce
Steps to reproduce the behavior:

Change adapter from GRPC.Client.Adapters.Gun to GRPC.Client.Adapters.Mint.

Expected behavior
No crashes.

Logs

GenServer #PID<0.5931.0> terminating
** (stop) exited in: GenServer.call(#PID<0.8619.0>, {:consume_response, {:headers, [{"content-type", "application/grpc+proto"}, {"date", "Tue, 06 Feb 2024 21:57:45 GMT"}, {"server", "envoy"}, {"x-envoy-upstream-service-time", "97"}]}}, 5000)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir 1.15.6) lib/gen_server.ex:1074: GenServer.call/3
    (grpc 0.7.0) lib/grpc/client/adapters/mint/stream_response_process.ex:66: GRPC.Client.Adapters.Mint.StreamResponseProcess.consume/3
    (grpc 0.7.0) lib/grpc/client/adapters/mint/connection_process/connection_process.ex:232: GRPC.Client.Adapters.Mint.ConnectionProcess.process_response/2
    (elixir 1.15.6) lib/enum.ex:2510: Enum."-reduce/3-lists^foldl/2-0-"/3
    (grpc 0.7.0) lib/grpc/client/adapters/mint/connection_process/connection_process.ex:190: GRPC.Client.Adapters.Mint.ConnectionProcess.handle_info/2
    (stdlib 4.3.1.2) gen_server.erl:1123: :gen_server.try_dispatch/4
    (stdlib 4.3.1.2) gen_server.erl:1200: :gen_server.handle_msg/6
    (stdlib 4.3.1.2) proc_lib.erl:240: :proc_lib.init_p_do_apply/3
Last message: {:tcp, #Port<0.160>, <<0, 0, 32, 1, 4, 0, 0, 0, 21, 136, 204, 97, 150, 223, 105, 126, 148, 3, 138, 97, 44, 106, 8, 2, 105, 65, 6, 227, 110, 220, 105, 181, 49, 104, 223, 203, 127, 1, 2, 57, 55, 0, 0, 128, 0, 0, 0, ...>>}

Protos
None

Versions:

  • OS: Ubuntu 22.04.3 LTS

  • Erlang: 25.1.2

  • Elixir: 1.14.1

  • mix.lock(grpc, gun, cowboy, cowlib):

    • cowboy 2.10.0 (Hex package) (rebar3)
      locked at 2.10.0 (cowboy) 3afdccb7
    • cowlib 2.12.1 (Hex package) (rebar3)
      locked at 2.12.1 (cowlib) 163b73f6
    • grpc 0.7.0 (Hex package) (mix)
      locked at 0.7.0 (grpc) 632a9507
    • gun 2.0.1 (Hex package) (rebar3)
      locked at 2.0.1 (gun) a10bc8d6

Additional context

I don't have a minimal example to demonstrate the issue. We have one small service that is running the Mint adapter just fine. Any other service we've tried using it has frequent crashes with the above log.

At first, I thought it might be related to #345. However, we already have our own connection pooling in place. After doing some calculations our pool configuration should have 4-10x the capacity necessary.

After auditing the code, I have some questions. First and foremost, why is the Mint adapter trapping exits on a process it doesn't own? My best guess is that it's because it's spawning and linking a couple processes (GRPC.Client.Adapters.Mint.ConnectionProcess and GRPC.Client.Adapters.Mint.StreamResponseProcess) and doesn't want those crashing to bring down the calling process.

I don't believe exit trapping is an appropriate choice. From the docs:

Setting :trap_exit to true - trapping exits should be used only in special circumstances as it would make your process immune to not only exits from the task but from any other processes.

Additionally, since trapping exits causes a message to be delivered to the linked process, this can cause other problems. If the process is a GenServer and it implements handle_info without a catch-all function, then this message will cause the GenServer to crash. Or, if it's some other process, then the process mailbox will slowly grow over time with these exit messages.

Once one or more of these processes crash, what would restart them? The calling process would only know it crashed from the exit signal message but doesn't have the knowledge to restart the processes. Wouldn't it be better to spawn GRPC.Client.Adapters.Mint.ConnectionProcess and GRPC.Client.Adapters.Mint.StreamResponseProcess under a DynamicSupervisor or equivalent? That would allow connections to restart properly, and it could limit the crashing to the relevant request instead of all requests in a connection.

I've already spent a week investigating this. I've now exceeded my time box to look into it. However, I can try to answer questions as needed.

Hi @rob-brown!

I created a small repo to try to reproduce your issue: https://github.com/Nezteb/absinthe_graphql_playground

I'll try to find some time over the next couple weeks to reproduce it. If you have any ideas of how I might reproduce this sooner, feel free to let me know! 😄

My main focus was concurrent connections since I thought that was the likely cause. So far it doesn't seem to be the case. There are some other, non-happy-path cases I haven't tested such as connection timeouts, high latency, and clients disconnecting abruptly.

I've since seen a few log series like the following in one service:

GenServer #PID<0.8616.0> terminating
** (Protocol.UndefinedError) protocol Enumerable not implemented for nil of type Atom. This protocol is implemented for the following type(s): DBConnection.PrepareStream, DBConnection.Stream, Date.Range, Ecto.Adapters.SQL.Stream, File.Stream, Flow, Function, GenEvent.Stream, HashDict, HashSet, IO.Stream, Jason.OrderedObject, List, Map, MapSet, Phoenix.LiveView.LiveStream, Postgrex.Stream, Range, Stream
    (elixir 1.15.6) lib/enum.ex:1: Enumerable.impl_for!/1
    (elixir 1.15.6) lib/enum.ex:166: Enumerable.reduce/3
    (elixir 1.15.6) lib/enum.ex:1227: Enum.find_value/3
    (epg 0.1.0) iex:402: anonymous fn/3 in [REDACTED]
    (epg 0.1.0) iex:399: [REDACTED]
    (phoenix_live_view 0.20.1) lib/phoenix_live_view/utils.ex:462: anonymous fn/5 in Phoenix.LiveView.Utils.call_handle_params!/5
    (telemetry 1.2.1) /opt/app/deps/telemetry/src/telemetry.erl:321: :telemetry.span/3
    (phoenix_live_view 0.20.1) lib/phoenix_live_view/channel.ex:547: Phoenix.LiveView.Channel.maybe_call_mount_handle_params/4
Last message: {Phoenix.Channel, %{"flash" => nil, "params" => %{"_csrf_token" => "HV4sO399WQhMITs3XDkBfT0dChAKGAYsS4bSK2kfuRlq6_RHNmKiyRGF", "_mounts" => 1, "_track_static" => ["[REDACTED]"]}, "session" => "[REDACTED]", "static" => "[REDACTED]", "url" => "[REDACTED]"}, {#PID<0.8614.0>, #Reference<0.1015246407.1051197442.43806>}, %Phoenix.Socket{assigns: %{}, channel: Phoenix.LiveView.Channel, channel_pid: nil, endpoint: EpgWeb.Endpoint, handler: Phoenix.LiveView.Socket, id: nil, joined: false, join_ref: "63", private: %{connect_info: %{session: %{"_csrf_token" => "NjNh4O2n9sWFjfS5spAysJAj"}}}, pubsub_server: nil, ref: nil, serializer: Phoenix.Socket.V2.JSONSerializer, topic: "lv:phx-F7FjdQqscG17sqIB", transport: :websocket, transport_pid: #PID<0.8614.0>}}

GenServer #PID<0.5931.0> terminating
** (stop) exited in: GenServer.call(#PID<0.8619.0>, {:consume_response, {:headers, [{"content-type", "application/grpc+proto"}, {"date", "Tue, 06 Feb 2024 21:57:45 GMT"}, {"server", "envoy"}, {"x-envoy-upstream-service-time", "97"}]}}, 5000)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir 1.15.6) lib/gen_server.ex:1074: GenServer.call/3
    (grpc 0.7.0) lib/grpc/client/adapters/mint/stream_response_process.ex:66: GRPC.Client.Adapters.Mint.StreamResponseProcess.consume/3
    (grpc 0.7.0) lib/grpc/client/adapters/mint/connection_process/connection_process.ex:232: GRPC.Client.Adapters.Mint.ConnectionProcess.process_response/2
    (elixir 1.15.6) lib/enum.ex:2510: Enum."-reduce/3-lists^foldl/2-0-"/3
    (grpc 0.7.0) lib/grpc/client/adapters/mint/connection_process/connection_process.ex:190: GRPC.Client.Adapters.Mint.ConnectionProcess.handle_info/2
    (stdlib 4.3.1.2) gen_server.erl:1123: :gen_server.try_dispatch/4
    (stdlib 4.3.1.2) gen_server.erl:1200: :gen_server.handle_msg/6
    (stdlib 4.3.1.2) proc_lib.erl:240: :proc_lib.init_p_do_apply/3
Last message: {:tcp, #Port<0.160>, <<0, 0, 32, 1, 4, 0, 0, 0, 21, 136, 204, 97, 150, 223, 105, 126, 148, 3, 138, 97, 44, 106, 8, 2, 105, 65, 6, 227, 110, 220, 105, 181, 49, 104, 223, 203, 127, 1, 2, 57, 55, 0, 0, 128, 0, 0, 0, ...>>}

GenServer ExRPC.Channel terminating
** (FunctionClauseError) no function clause matching in ExRPC.Channel.handle_info/2
    (ex_rpc 10.0.4) lib/ex_rpc/channel.ex:74: ExRPC.Channel.handle_info({:EXIT, #PID<0.5931.0>, {:noproc, {GenServer, :call, [#PID<0.8619.0>, {:consume_response, {:headers, [{"content-type", "application/grpc+proto"}, {"date", "Tue, 06 Feb 2024 21:57:45 GMT"}, {"server", "envoy"}, {"x-envoy-upstream-service-time", "97"}]}}, 5000]}}}, [REDACTED])
    (stdlib 4.3.1.2) gen_server.erl:1123: :gen_server.try_dispatch/4
    (stdlib 4.3.1.2) gen_server.erl:1200: :gen_server.handle_msg/6
    (stdlib 4.3.1.2) proc_lib.erl:240: :proc_lib.init_p_do_apply/3
Last message: {:EXIT, #PID<0.5931.0>, {:noproc, {GenServer, :call, [#PID<0.8619.0>, {:consume_response, {:headers, [{"content-type", "application/grpc+proto"}, {"date", "Tue, 06 Feb 2024 21:57:45 GMT"}, {"server", "envoy"}, {"x-envoy-upstream-service-time", "97"}]}}, 5000]}}}

It appears that throwing an exception in a handler can kill the stream response process, which produced the error I'm seeing. It also caused an exit message to be sent to a GenServer that wasn't expecting it in handle_info and it crashed too.

This is not the typical case. Almost every time I've seen the crash, it was not due to the handler crashing. There were no other related logs with the crash to provide more context.

Hey @rob-brown

Sorry, I haven't been able to check on this repo issues. I'll try to understand what is happening with your code any time in the next weeks (we could even schedule a pair to try to understand this together)

I few highlights:

After auditing the code, I have some questions. First and foremost, why is the Mint adapter trapping exits?

While designing the adapter, I tried to keep the behavior consistent with what Gun adapter does. Specially when setting up the connection. So, you might see some shenanigans in the code in order to keep their behavior similar (or almost that).

If the maintainers agreed, I open to redesign the adapter to be more elixir-wise (with a proper OTP handling of the possible issues), but that will cause the adapters to differ on their behavior.


Now, about your exception. What I can tell you is that, something is killing the stream_response process outside the scope of the adapter. Could it be that the process that start the request (not the connection) is dying before the response arrives?

It's been a while since I last looked at this but I never saw any crash reports or logs pointing to the owning processes crashing. That's what I would have expected and looked for first. I'm not aware of any logic that terminates connections early either.