ApsaraDB / PolarDB-for-PostgreSQL

A cloud-native database based on PostgreSQL developed by Alibaba Cloud.

Home Page:https://apsaradb.github.io/PolarDB-for-PostgreSQL/zh/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Question] After testing, it feels much slower than postgresql.

long2ice opened this issue · comments

A very simple usage scenario. I run the integration test locally. It takes more than 60 seconds to use postgresql 11, but it takes more than 130 seconds to run on polardb. Both are stand-alone versions deployed using docker.

It shouldn't feel so different? Is my usage posture wrong? Is there any relevant parameters that can be tuned?

Hi @long2ice ~ Thanks for opening this issue! 🎉

Please make sure you have provided enough information for subsequent discussion.

We will get back to you as soon as possible. ❤️

@long2ice Hi, thanks for testing PolarDB-PG.

Can you describe about your testing workload? Is it mainly DDL, or DML, or something else? Also, What is the shared_buffer size of PostgreSQL 11?

@mrdrivingduck Hello, thanks for your quick reply! The testing workload is create database, insert dataset, and then run tests. Most of test case is DML. And postgres shared_buffers is 128M, for polardb is 2GB.

@long2ice Could you please add some log print during these three phase: [TIME] create database [TIME] insert data [TIME] run tests [TIME]? So that we can peek which part is slow.

There could be other reasons. For example, in the container of PolarDB-PG, actually there is three databases running: one is primary, two is replica, with synchronous_commit set to on. I'm not sure if it is a problem.

polardb:
create database and run migration (DDL): 28s
init data: 22s
run all test: 65s
total: 115s

postgres:
create database and run migration (DDL): fast
init data: 13s
run all test: 55s
total: 68s

@long2ice

Try following commands in PolarDB-PG container:

Stop two replica database:

pg_ctl -D /var/polardb/replica_datadir1/ stop
pg_ctl -D /var/polardb/replica_datadir2/ stop

Drop the replication slot on primary:

select pg_drop_replication_slot('replica1');
select pg_drop_replication_slot('replica2');
postgres@e086c61cd078:~$ pg_ctl -D /var/polardb/replica_datadir1/ stop
pg_ctl: PID file "/var/polardb/replica_datadir1/postmaster.pid" does not exist
Is server running?
postgres@e086c61cd078:~$ pg_ctl -D /var/polardb/replica_datadir2/ stop
pg_ctl: PID file "/var/polardb/replica_datadir2/postmaster.pid" does not exist
Is server running?

Maybe they are not running?
I used polardb/polardb_pg_local_instance docker image to deploy that.

@long2ice Can you run ps -ef to see if there are three process groups running? If there is only one, that's fine.

postgres@da9f3038df35:~$ ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
postgres       1       0  0 11:26 ?        00:00:00 /bin/bash ./docker-entrypoint.sh postgres
postgres      16       1  1 11:26 ?        00:00:00 /home/postgres/tmp_basedir_polardb_pg_1100_bld/bin/postgres -D /var/polardb/primary_datadir
postgres      17      16  0 11:26 ?        00:00:00 postgres(5432): logger  0
postgres      18      16  0 11:26 ?        00:00:00 postgres(5432): logger  1
postgres      19      16  0 11:26 ?        00:00:00 postgres(5432): logger  2
postgres      20      16  0 11:26 ?        00:00:00 postgres(5432): background flashback log inserter  
postgres      21      16  0 11:26 ?        00:00:00 postgres(5432): background flashback log writer  
postgres      23      16  0 11:26 ?        00:00:00 postgres(5432): polar worker process  
postgres      24      16  0 11:26 ?        00:00:00 postgres(5432): PSS dispatcher  
postgres      25      16  0 11:26 ?        00:00:00 postgres(5432): PSS dispatcher  
postgres      26      16  0 11:26 ?        00:00:00 postgres(5432): polar wal pipeliner  
postgres      28      16  0 11:26 ?        00:00:00 postgres(5432): checkpointer  
postgres      29      16  0 11:26 ?        00:00:00 postgres(5432): background writer  
postgres      30      16  0 11:26 ?        00:00:00 postgres(5432): walwriter  
postgres      31      16  1 11:26 ?        00:00:00 postgres(5432): background logindex writer  
postgres      32      16  0 11:26 ?        00:00:00 postgres(5432): autovacuum launcher  
postgres      33      16  0 11:26 ?        00:00:00 postgres(5432): stats collector  
postgres      34      16  0 11:26 ?        00:00:00 postgres(5432): TimescaleDB Background Worker Launcher  
postgres      35      16  0 11:26 ?        00:00:00 postgres(5432): logical replication launcher  
postgres      36      16  0 11:26 ?        00:00:00 postgres(5432): polar parallel bgwriter  
postgres      37      16  0 11:26 ?        00:00:00 postgres(5432): polar parallel bgwriter  
postgres      38      16  0 11:26 ?        00:00:00 postgres(5432): polar parallel bgwriter  
postgres      39      16  0 11:26 ?        00:00:00 postgres(5432): polar parallel bgwriter  
postgres      40      16  0 11:26 ?        00:00:00 postgres(5432): polar parallel bgwriter  
postgres      64       1  0 11:26 ?        00:00:00 tail -f /dev/null
postgres      65       0  0 11:27 pts/0    00:00:00 bash
postgres      77      65  0 11:27 pts/0    00:00:00 ps -ef

@long2ice Seems that only primary node is running. Drop the replication slot on primary, and run the test again?

There maybe no replication slot
image

Weird. Is your image update to date?

docker pull polardb/polardb_pg_local_instance

By default, there will be three nodes running inside container.

I recreate the container, now there are three nodes. Then I run tests, which cost 143s.
Then I stop replica nodes and remove slots, and run tests, which cost 82s. It seems faster, but still slow than postgres.

Which phase does this (82-68) seconds come from?

Looks like create database and migration, just ddl

@long2ice OK. In a real benchmark scenario, the time of preparation (table schema creation, data importing) is not calculated, we usually care about the TPS (transaction per second) or QPS (query per second) on CRUD (DML). DDL usually cannot be executed concurrently so that cannot measure the throughput of a system.

The benefit of shutting down replica is because for some DDL, the primary writes a WAL record and must wait until replicas read and replay the WAL record before it can move on. This incurs extra I/O operation and latency. Why I put three nodes in our Docker container is because some cluster level features can be experienced easily, not for performance benchmarking.

OK, thanks for your help!

@mrdrivingduck Hello, after test another found is that the writing speed is slow. Is there anything that can be optimized?

@mrdrivingduck Hello, after test another found is that the writing speed is slow. Is there anything that can be optimized?

We have known writing could be a short board of PolarDB-PG, especially INSERT. If you are importing data, Use PG's COPY grammar instead of INSERT.

I know, thanks for your reply!

Hello, happy new week! Sorry to bother you again. I found another strange phenomenon. After I import data to polardb, If I make select query at once, the query result may be different from the expected result. But if I wait serval minutes and select again, anything work fine. What's the problem? What I did was stop replica nodes and remove replica slots, and set polar_enable_shared_server = off, polar_enable_shm_aset = off. I found change the two options can resolve some problems of query timeout.

Hello, happy new week! Sorry to bother you again. I found another strange phenomenon. After I import data to polardb, If I make select query at once, the query result may be different from the expected result. But if I wait serval minutes and select again, anything work fine. What's the problem? What I did was stop replica nodes and remove replica slots, and set polar_enable_shared_server = off, polar_enable_shm_aset = off. I found change the two options can resolve some problems of query timeout.

These two parameters represent the shared server capability. After setting them, it's necessary to restart the database. Please set them to 'off' and restart the database to verify if the inconsistency issue still persists.it seems that the data consistency problem is not closely related to these two parameters. When you observe at two different querying time points, you can check if there is any difference in background processes and whether there is an operation similar to replay taking place.

image

They both are off already, but the inconsistency issue still persists. So I wonder there are other options to make effect.

image They both are `off` already, but the inconsistency issue still persists. So I wonder there are other options to make effect.

Have you restarted the database after completing the parameter settings?

Yes. of course. I use pg_ctl -D /var/polardb/primary_datadir/ restart

At the moment when the data inconsistency issues arise during the query, could you run the ps -ef command to check the background process activities?

ok,at the moment of data consistency, please also execute the ps -ef command to observe the background process activities.Let's compare to see if there is a process that might be obstructing data visibility.

OK, that's it.

image

Looks like a transaction connection.

Yes, it seems the issue is likely caused by this user process. This appears to be a user process,do you know what operation it is executing? It's probable that this process is performing some sort of DML operation on the query data (akin to the data not yet being committed). Alternatively, you could use the gdb command to examine what this operation is doing.

My work flow is create database, seed data ( exec large sql file in transaction) and run migration to keep database table latest ( some DDL). But I think if the program finish, the action in database should also finish. That doesn't seem to be the case.

update: looks like not related to the transaction connection. I tried again, no transaction connection but still failed.

Could you send over the operational process so we can verify it later? now I think the most likely possibility is that the workflow considers the data writing to be completed before the transaction is committed.

@long2ice Any minimum reproduce SQLs and steps based on the container started from the Docker image? We can deploy a docker container from the same image and it is easier to reproduce.

I can't reproduce in minimum program, It seems to happen when there's a lot of data or DDL

Finally we resolve that by turn off preread related settings.

Finally we resolve that by turn off preread related settings.

Which setting did you set exactly? I am curious.