Some improvement for PostgreSQL and compatible PostgreSQL databases

Question

Some improvement for PostgreSQL and compatible PostgreSQL databases

FranckPachot opened this issue 3 years ago · comments

Franck Pachot (YB) commented 3 years ago

Hi, I was surprised by the results on CockroachDB and YugabyteDB YSQL. It may seem obvious that the Cassandra API is better suited at intensive writes, but here, in this IoT ingest use-case, there are no cross-node transactions: only one table, no secondary indexes. I don't expect a huge difference between both APIs. Also, the numbers look really slow, especially on a cluster with 3x32vCPU. I've run the same workload on my laptop, with YugabyteDB 2.7, with only one session and that's 10000 rows per second inserted. This gives an idea of what the database engine can do - without external components like network.

I looked at the code and there are a few things that should be done differently on PostgreSQL and compatible PostgreSQL databases. I've implemented them for unit testing in a Jupyter notebook where I commented the changes. I can also suggest a PR for them.

In summary:

You are defining the primary key with the old SERIAL but Standard SQL sequence generators are recommended in current versions, and allow to use a large cache, always good, especially in distributed databases. The gain is huge when the batch size is not very large. More on this: https://dev.to/yugabyte/uuid-or-cached-sequences-42fi
On distributed databases we need more control on the sharding method, so I've added a parameter to specify HASH / RANGE sharding
You generate a synthetic primary key. In real-life IoT, the primary key stores the rows as they will be queried (usually hash on device_id and range on timestamp), or it would need a secondary index (which slows down the data ingest with cross-shard transactions)
psycopg2 used here is an old driver which doesn't support prepared statements and sends queries in text for each call. A least it has a call for COPY brings the most important improvement
The most important: fast load in PostgreSQL should use COPY rather than INSERT. A list of value is better than single-row commands, but still not optimal, especially without prepared statements where the backend has to parse it.

The run times in the notebook were on a 4 vCPU Ampere on Oracle Cloud free tier, running YugabyteDB 2.7 with replication factor 3. And with only one session. With network latency as the notebook runs in a different region. It gives an idea of the expected throughput per session. Of course the goal is not to give benchmark numbers, but to show how IoT data ingest should be implemented on PostgreSQL-compatible. Using COPY is a must. Cached sequences as well. And a primary key designed for IoT. Once done, this should scale high with multiple workers.

Sebastian Woehrl · Answer 1 · Fri Sep 03 2021 19:16:44 GMT+0800 (China Standard Time)

Hi @FranckPachot,

first of all, thank you very much that you have taken the time to analyze the code and for suggesting improvements. And also for all the effort you have done in setting up a detailed jupyter notebook.
Regarding your suggestions for handling the primary key: Good suggestion, in our tests we've seen that SERIAL does not have the best performance, especially in distributed databases so a sequence generator should help. If you have the time and energy for a PR for these changes I would appreciate it. If not, no worries, I will find some time to go over your notebook and incorporate the relevant changes.
As I'm not aware of any (widely used) python postgres driver besides psycopg2, do you have a suggestion there?
Regarding using COPY: I must confess I have not really thought about using COPY as I had it filed away as only being useful for loading files into the database, but your approach of just putting together the data for COPY on-the-fly is very nice. Again if you have the time for a PR I would appreciate it.

Thanks again for taking the time. I appreciate it as indeed we had hoped for a better performance of CockroachDB and YugabyteDB SQL. So we are very interested in increasing that performance.

Franck Pachot (YB) · Answer 2 · Fri Sep 03 2021 20:35:13 GMT+0800 (China Standard Time)

Thank you, I'll try to find some time to put that in a PR.
About drivers, there are many: pyscopg3, pg8000, aiopg, asyncpg... I still have to test them on distributed databases.
What I've seen thanks to your tool is that pyscopg2 is ok with COPY. But for batch_mode=false I expect much better from prepared statements.

Franck Pachot (YB) · Answer 3 · Sat Sep 18 2021 21:43:17 GMT+0800 (China Standard Time)

Hi @swoehrl-mw I've put the changes in PR #4

Sebastian Woehrl · Answer 4 · Tue Sep 21 2021 21:12:10 GMT+0800 (China Standard Time)

Thanks for the PR. I will look it over and merge it in the next few days.

Sebastian Woehrl · Answer 5 · Fri Sep 24 2021 21:13:39 GMT+0800 (China Standard Time)

Hi @FranckPachot, I've merged the PR. Thanks again for suggesting the optimizations and for taking the time to create the PR.
As soon as I find the time I will rerun the tests for all postgresql-compatible databases with your optimizations and update the results with the improved numbers.

Sebastian Woehrl · Answer 6 · Fri Oct 15 2021 20:25:00 GMT+0800 (China Standard Time)

Hi @FranckPachot, I finally got around to rerunning the tests and have updated the README with new results. With your changes Yugabyte about doubles its speed.

Franck Pachot (YB) · Answer 7 · Sat Jul 09 2022 01:40:26 GMT+0800 (China Standard Time)

I see this was still open. I close it. But if you want to run the benchmark again, you may see huge improvements with the latest version of YugabyteDB.
I'll also submit a PR to enable a few YugabyteDB improvement that make sense for IoT workload