2ndQuadrant / pglogical

Logical Replication extension for PostgreSQL 15, 14, 13, 12, 11, 10, 9.6, 9.5, 9.4 (Postgres), providing much faster replication than Slony, Bucardo or Londiste, as well as cross-version upgrades.

Home Page:http://2ndquadrant.com/en/resources/pglogical/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Is this library even maintained anymore?

nitinsh99 opened this issue · comments

Not pointing any fingers since it's a great open source tool but just trying to get an affirmation if this library is worth looking into given the huge amount of issue backlog and no response from any of the maintainers. It makes me a bit nervous to consider this library for our proudction use. I am also one of the people who are running into issues while using this library with AWS RDS, the subscriber simply exit with code 1 and all I see if following in the logs

2023-01-28 01:28:42 UTC:host(31560):pguser@db:[2495]:LOG: could not receive data from client: Connection reset by peer
2023-01-28 01:28:42 UTC:UTC:host(31560):pguser@db:[2495]:LOG: unexpected EOF on client connection with an open transaction
2023-01-28 01:28:42 UTC::@:[367]:LOG: worker process: pglogical apply 367023:976906607 (PID 2449) exited with exit code 1
2023-01-28 01:28:52 UTC::@:[370]:LOG: checkpoint starting: time
2023-01-28 01:28:56 UTC::@:[370]:LOG: checkpoint complete: wrote 40 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=3.943 s, sync=0.003 s, total=3.959 s; sync files=26, longest=0.003 s, average=0.001 s; distance=62 kB, estimate=62 kB
``

Multiple people have reported this and there is nothing in the logs to investigate further. 

AFAICS pglogical is in maintenance mode. Regarding your question, it seems the connection between nodes was interrupted. Check your network. You should also check PostgreSQL parameters related to network (those that start with tcp).

I have spent hours looking for network issues. I ended up making our master and replica db completely public in AWS and still getting the issue. Lack of any useful logs is the main issue I guess.

In the past year I've migrated a handful of tables from a self-managed postgres cluster in EC2 to RDS (aurora) using pglogical and ran into errors that had similar symptoms to this in the logs a few times - each time that I ran into crash loops on the subscriber, the logs looked similar and the problems ended up being due to configuration issues/my use of pglogical being unsupported (especially around partitioning) or due to bugs that appeared with particular combinations of postgres versions and pglogical versions.

Although the error message is just saying "Connection reset by peer", this might be unrelated to the network and could be a bug or an issue with configuration or unsupported use case[1] - pglogical behavior can be very sensitive to both the version of the extension and the version of postgres it is installed on, and minor/patch version bumps of either can break things in my experience. This is often compounded on RDS since you don't have as much control over the exact versions of the extension and postgres.

Anecdotally, sometimes RDS logs redact/truncate useful error information, I have in the past had success root-causing errors with pglogical in RDS by setting up my own testing environment and seeing richer information in the logs there. For instance, I was able to trace down a bug occurring with RDS aurora version 11.9 and pglogical 2.2.2 by reproducing the error (pglogical fixed in #295 / c34d52f + c78abe9 - or disabling pglogical.batch_insert would have fixed it on 2.2.2)

But almost just as soon as I had figured that out, a minor engine version update was applied to the cluster I was working with, which pulled in a new extension version (2.4.0) that fixed that issue but caused a separate crash with slightly different observed behavior. The other issue which ended up being related to postgres/postgres@84f5c29 which was pulled into minor releases of almost all supported versions of postgres at the time (so 11.13, 12.8, 13.4). It affected any minor versions of 11, 12, or 13 at least as new as those minor releases when pglogical version 2.4.0 or older was installed on the subscriber (specifically when using the setting pglogical.use_spi = 1). Again, setting up my own test environment was helpful since there I could see "cannot execute SQL without an outer snapshot or portal" in the logs (the RDS logs redacted this for reasons unknown) which lined up with the release notes comment on this change https://www.postgresql.org/docs/11/release-11-13.html:

Some extensions may attempt to execute SQL code outside of any Portal. They are responsible for ensuring that an outer snapshot exists before doing so. Previously, not providing a snapshot might work or it might not; now it will consistently fail with “cannot execute SQL without an outer snapshot or portal”.

And that issue between postgres and pglogical was fixed in pglogical 2.4.1 - I didn't actually directly test pglogical 2.4.1 on RDS since that extension version was not yet available in RDS and I found a workaround for my use case that involved setting up another instance with pglogical 2.4.1 in the middle and cascading pglogical replication.

It's probably also a good idea to contact AWS support about this since there may be some RDS-specific issue happening here too. But even if you do I still recommend attempting to reproduce the issue outside of RDS with the exact same versions of postgres and pglogical and the same configuration installed on the source db and the subscriber, and if the same issue occurs there, checking all the logs there to see if they give more information than the RDS logs.

[1] (if you are migrating to or from a partitioned schema, depending on exactly what you are trying to do it may be possible to use pglogical for the migration but partitioned tables are generally not explicitly supported so YMMV https://www.2ndquadrant.com/en/blog/pg-phriday-pglogical-postgres-10-partitions/)