tattle-made / DAU

MCA Tipline for Deepfakes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Identify Optimum PostGres settings for launch

dennyabrain opened this issue · comments

One of the bottlenecks in the infra is how many concurrent writes can our database support.
One of the scenarios we want to be prepared for is getting ten lakh messages over an hour. We have strategies in place to scale (vertically or horizontally) our web server. That means these web servers would open connections to our Postgres instance. The client library for dealing with databases (Ecto) used in the web server has support for managing connection pools.
Scope of this feature is to find out

  1. For a given amount of RAM how many simultaneus connections can a Postgres server handle
  2. Best practices of connection pooling to be configured in the Ecto (database ORM used in the web server)
  3. Ideal combination of Postgres, web server count and connection pool size to handle our usage scenario

source

With Postgres, each new connection can take upto 1.3MB in memory

One thing that works in our favour is that we need this high throughput only for one specific write operation. This is the operation to write incoming whatsapp messages into the database. So we don't have to worry about any update collision related issues.

Tradeoffs for client-side pooling (e.g. Ecto) vs. middleware (external) pooling (e.g. pgBouncer, pgPool)

My understanding is that using something like PgBouncer can help you in situations in which for some reason you have to create/remove database connections in a fast way or you want to have more database connections than your database instance actually is supposed to support.

One example that I came personally across was when I was using elixir together with kubernetes. During some of our operations, we would spawn many kubernetes pods at the same time, and all of them would try to get Database connections together. In those times, it would be easy to go over the number database connections that our Postgres instance actually supported.
From my experience and understanding, the disadvantages of an external pool are:
- additional failure point (including from a security standpoint)
- additional latency
- additional complexity in deployment
- security complications w/ user credentials

Usually, a connection pool on the application side is a good thing for the reasons you detail. An external connection pool only makes sense if
- your application server does not have a connection pool
- you have several (many) instances of the application server, so that you cannot effectively limit the number of database connections with a connection pool in the application server

Other relevant info

When applications reach a certain scale, a single database may not be enough to sustain the required throughput. In such scenarios, it is very common to introduce read replicas: all write operations are sent to the primary database and most of the read operations are performed against the replicas. The credentials of the primary and replica databases are typically known upfront by the time the code is compiled.
:pool_size - the size of the pool used by the connection module. Defaults to 10

Max number of connections postgres can support

You can often support more concurrent users by reducing the number of database connections and using some form of connection pooling

(See section) How to Find the Optimal Database Connection Pool Size

Instructions on how to run pgbench for profiling insert operations - https://github.com/tattle-made/feluda/wiki/Optimization#testing-insert-performance