commanded / commanded

Use Commanded to build Elixir CQRS/ES applications

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Aggregate crashes due to DB connection timeout

slashmili opened this issue · comments

commented

Hi,
I'm using v.1.2.0 version of commanded.

In some rare cases, I see this error message:

GenServer terminating (Protocol.UndefinedError): protocol Enumerable not implemented for {:error, %DBConnection.ConnectionError{message: "connection not available and request was dropped from queue after 1996ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:\n\n 1. Ensuring your database is available and that you can connect to it\n 2. Tracking down slow queries and making sure they are running fast enough\n 3. Increasing the pool_size (albeit it increases resource consumption)\n 4. Allowing requests to wait longer by increasing :queue_target and :queue_interval\n\nSee DBConnection.start_link/2 for more information\n", reason: :queue_timeout, severity: :error}} of type Tuple. This protocol is implemented for the following type(s): Ecto.Adapters.SQL.Stream, Postgrex.Stream, DBConnection.PrepareStream, DBConnection.Stream, Timex.Interval, MerkleMap, Map, File.Stream, Date.Range, MapSet, IO.Stream, GenEvent.Stream, List, HashDict, HashSet, Range, Function, Stream (Most recent call last)

File lib/enum.ex line 1 in Enumerable.impl_for!/1 (elixir)
File lib/enum.ex line 141 in Enumerable.reduce/3 (elixir)
File lib/enum.ex line 3473 in Enum.reduce/3 (elixir)
File lib/commanded/aggregates/aggregate.ex line 181 in Commanded.Aggregates.Aggregate.handle_continue/2 (commanded)
File gen_server.erl line 689 in :gen_server.try_dispatch/4 (stdlib)
File gen_server.erl line 431 in :gen_server.loop/7 (stdlib)
File proc_lib.erl line 226 in :proc_lib.init_p_do_apply/3 (stdlib)

Looks like it caused by this part of this part of the code. https://github.com/commanded/commanded/blob/v1.2.0/lib/commanded/aggregates/aggregate.ex#L398-L409

It's possible to add handle this in this case statement. I'm wondering what should happen in this case?
Should we return state or we actually should bring down the aggregator?

    case EventStore.stream_forward(
           application,
           aggregate_uuid,
           aggregate_version + 1,
           @read_event_batch_size
         ) do
      {:error, :stream_not_found} ->
        # aggregate does not exist, return initial state
        state
     {:error, _} ->
     return state or crash the gen server?
      event_stream ->
        rebuild_from_event_stream(event_stream, state)
    end

I am noticing this on my dev machine when stress testing a simple commanded app. Just to see how commanded acts I spawned 2k processes that each dispatch a command with a different UUID. This overwhelms the db connection queue and when it starts dropping connections here is the cascade of errors that comes out with the long string describing how to mitigate it at the db level replaced by [...]:

09:14:03.002 [warn]  Failed to read events from stream due to: "connection not available and request was dropped from queue after 230ms. [...]"
 
09:14:03.002 [warn]  Failed to read events from stream id 485590 due to: "[...]"
 
09:14:03.002 [error] EventStore notifications failed to read events due to: "[...]"

09:14:03.015 [warn]  Failed to ack last seen event on stream "$all" named "TestCases.Projections.InventoryCountsProjector" due to: %DBConnection.ConnectionError{message: "[...]", reason: :queue_timeout, severity: :error}

09:14:03.017 [warn]  Failed to ack last seen event on stream "$all" named "TestCases.Projections.ProductsProjector" due to: %DBConnection.ConnectionError{message: "[...]", reason: :queue_timeout, severity: :error}

09:14:03.051 [error] ** (FunctionClauseError) no function clause matching in EventStore.Storage.Appender.handle_response/1
    (eventstore 1.3.1) lib/event_store/storage/appender.ex:155: EventStore.Storage.Appender.handle_response({:error, %DBConnection.ConnectionError{message: "[...], reason: :queue_timeout, severity: :error}})
    (eventstore 1.3.1) lib/event_store/storage/appender.ex:27: anonymous fn/5 in EventStore.Storage.Appender.append/4
    (elixir 1.12.2) lib/enum.ex:935: anonymous fn/3 in Enum.each/2
    (elixir 1.12.2) lib/enum.ex:3952: anonymous fn/3 in Enum.each/2
    (elixir 1.12.2) lib/stream.ex:1707: anonymous fn/3 in Enumerable.Stream.reduce/3
    (elixir 1.12.2) lib/stream.ex:285: Stream.after_chunk_while/2
    (elixir 1.12.2) lib/stream.ex:1736: Enumerable.Stream.do_done/2
    (elixir 1.12.2) lib/enum.ex:3952: Enum.each/2

10:53:06.233 [error] GenServer {TestCases.App.LocalRegistry, {TestCases.App, TestCases.Inventory, "inventory-a636d44f-4842-4e1e-bff3-49245a709c9b"}} terminating
** (stop) %FunctionClauseError{args: nil, arity: 1, clauses: nil, function: :handle_response, kind: nil, module: EventStore.Storage.Appender}
Last message (from #PID<0.14788.1>): {:execute_command, %Commanded.Aggregates.ExecutionContext{causation_id: "36afe893-bb54-4921-9acd-cb594bbfc5c6", command: %TestCases.Inventory.Commands.ReceiveInventory{count: 73, product_id: "a636d44f-4842-4e1e-bff3-49245a709c9b", warehouse_id: "90182e2a-d9e3-4a1a-90f3-32ef131f4139"}, correlation_id: "938fce6b-f1a5-4d5c-9e1c-cafda96d6332", function: :execute, handler: TestCases.Inventory, lifespan: Commanded.Aggregates.DefaultLifespan, metadata: %{}, retry_attempts: 10, returning: false}}

What are the consequences when the database starts dropping queued connection requests? This looks like the command is 'successful' but the event is not committed? What is the right way to recover here?

@jdewar If an aggregate process cannot append events to its event stream then an error should be returned to the dispatch command function call. The guarantee is that an :ok is only returned from a command dispatch when any events produced by the command have been successfully persisted. In this scenario you could manually retry the failed command or return the error to the initiator (e.g. HTTP request, Web UI).