Out of memory exceptions under high load
Sh1ftry opened this issue · comments
Hello,
we have been using Cassandra plugin and have decided to switch to JDBC plugin with PostgreSQL. However, we have run into an issue, that stops us from migrating. Our service is unable to start, due to out of memory exceptions, when restarting under high load, with many events in journal.
postgres=> select count(*) from journal group by persistence_id;
count
--------
133894
205311
233498
165092
166375
316931
144082
191159
325597
181759
173260
325610
157724
233744
341326
157182
214548
320728
(18 rows)
Plugin configuration:
slick {
profile = "slick.jdbc.PostgresProfile$"
db {
host = "localhost"
host = ${?POSTGRES_HOST}
url = "jdbc:postgresql://"${slick.db.host}":5432/postgres?reWriteBatchedInserts=true"
user = "postgres"
user = ${?POSTGRES_USER}
password = "postgres"
password = ${?POSTGRES_PASSWORD}
driver = "org.postgresql.Driver"
numThreads = 5
numThreads = ${?POSTGRES_THREADS}
maxConnections = 5
maxConnections = ${?POSTGRES_MAX_CONNECTIONS}
minConnections = 1
minConnections = ${?POSTGRES_MIN_CONNECTIONS}
}
}
akka.persistence {
journal.plugin = "jdbc-journal"
snapshot-store.plugin = "jdbc-snapshot-store"
at-least-once-delivery {
warn-after-number-of-unconfirmed-attempts=3
max-unconfirmed-messages=1000
redeliver-interval=10s
redeliver-interval=${?AT_LEAST_ONCE_DELIVERY_REDELIVER_INTERVAL}
redelivery-burst-limit=50
}
}
jdbc-journal {
slick = ${slick}
batchSize = 40
batchSize = ${?POSTGRES_JOURNAL_BATCH_SIZE}
parallelism = 6
parallelism = ${?POSTGRES_JOURNAL_PARALLELISM}
event-adapters {
msg-dispatched-adapter = "abc.xyz.MessageDispatchedAdapter"
msg-processed-adapter = "abc.xyz.MessageProcessedAdapter"
}
event-adapter-bindings {
"xyz.Abc$MessageProcessed" = msg-dispatched-adapter
"xyz.Abc$MessageDispatched" = msg-processed-adapter
}
}
jdbc-snapshot-store {
slick = ${slick}
}
jdbc-read-journal {
slick = ${slick}
max-buffer-size = "50"
max-buffer-size = ${?POSTGRES_READ_JOURNAL_MAX_BUFFER_SIZE}
}
Analysing heap dump shows over million allocations of byte array:
byte[] 1,030,439 (44.9%) 781,349,704 B (91.5%) n/a
We are using latest plugin version, which should read events from journal in batches.
@Sh1ftry The number of events per persistence id seems quite large, do you use snapshots?
I know of one potential cause of this issue:
See the note at the end of https://scala-slick.org/doc/3.3.2/dbio.html#streaming
In akka-persistence-jdbc, we do not set statement parameters and we do not explicitly run the query to retrieve messages in a transaction!
This means that the postgres jdbc driver will retrieve all events for the persistenceId under the hood, even though Slick tries to create a stream out of it.
We do take snapshots, but we do it every couple of minutes, so with very high load the number of events is large. Switching to taking snapshots on every N events seems to have helped.
In akka-persistence-jdbc, we do not set statement parameters and we do not explicitly run the query to retrieve messages in a transaction!
Is there any reason why you are not doing this? If it has any downsides, wouldn't it be useful to put a parameter in configuration, that if explicitly set, would enable this?
@WellingR Does implementing this require lots of work? If not, then we could possibly create a pull request with the change.
We do take snapshots, but we do it every couple of minutes, so with very high load the number of events is large
Would it be possible for you to snapshot every 2 minutes or every 50 or 100 events (whichever happens first)?
@renatocaval #248 is a PR which fixes a somewhat similar issue, however there is a big difference between how we currently execute the eventsByX queries and how we retrieve the events from the journal.
For the eventsByX queries we query a number of events with a configured batch size, and we execute separate queries to retrieves the batches.
To retrieve events for the journal we simply run a query and return the complete result of the query at once. While we do use the slick reactive streams publisher. It appears that it does fetch all events returned by the db in memory at once, which could cause excessive memory consumption.
I see two possible fixes:
- doing what is described in https://scala-slick.org/doc/3.3.2/dbio.html#streaming (i am unsure if this method works for all databases, but this is easy to implement, see #351)
- using a method similar to the eventsByX queries for the journal events, so we can retrieve the events in batches, and execute multiple queries if not all events are returned in one batch.
Related to slick/slick#1305
Thanks @WellingR, I will get to this soon. On my TODO list.
Note about incompatible changes in #370 that will be released in version 4.0.0.
The APIs for custom DAOs are not guaranteed to be binary backwards compatible between major versions of the plugin. For example 4.0.0 is not binary backwards compatible with 3.5.x. There may also be source incompatible changes of the APIs for customer DAOs if new capabilities must be added to to the traits.
This change required addition of a method messagesWithBatch
to JournalDao
and ReadJournalDao
, which means that custom implementation of these will have to add implementation for that.
messagesWithBatch
is implemented in the BaseJournalDaoWithReadMessages
trait, which is mixed in by BaseByteArrayJournalDao
and BaseByteArrayReadJournalDao
. That means that custom implementations of these will still be source compatible.