Running CREATE INDEX CONCURRENTLY on subscriber can break subscription (subscriber: postgres 15.6 + pglogical 2.4.4) - error message in logs: could not read block 0 in file ""base/[DB_OID]/[IDX_RELATION_FILENODE]"": read only 0 of 8192 bytes

Question

Running CREATE INDEX CONCURRENTLY on subscriber can break subscription (subscriber: postgres 15.6 + pglogical 2.4.4) - error message in logs: could not read block 0 in file ""base/[DB_OID]/[IDX_RELATION_FILENODE]"": read only 0 of 8192 bytes

aristocrates opened this issue 2 months ago · comments

Nicholas (Nick) Meyer commented 2 months ago

I don't have a minimal repro for this right now, but the symptom I have observed on a subscriber that I am syncing for an upgrade is:

(context) I have modified the subscriber schema to just have primary key indexes and no constraints for the initial data sync - plan is to create the other indexes concurrently on the subscriber once all tables have synced, and then after index creation to create all foreign key constraints (with NOT VALID followed by VALIDATE CONSTRAINT)
But, on some indexes in the desired final schema, running CREATE UNIQUE INDEX CONCURRENTLY [options]; on the subscriber results in the subscription going status: down with a log line similar to the following in the CSV log:

2024-04-02 15:16:17.297 PDT,,"$DBNAME",4712,,6608afb3.1268,3738,"",2024-03-30 17:34:59 PDT,7/184779565,0,ERROR,XX001,"could not read block 0 in file ""base/16400/105657687"": read only 0 of 8192 bytes",,,,,"apply INSERT from remote relation $TABLE_NAME in commit before 45415/83927478, xid 197646281 committed at 1999-12-31 16:12:45.411377-08 (action #2) from node replorigin 1",,,,"pglogical apply 16400:988484226","pglogical apply 16400:988484226",,0

In this particular example, 105657687 is indeed the result of calling pg_relation_filenode($INDEX_NAME) on the currently-INVALID $INDEX_NAME that is being created concurrently at the time the subscription goes down:

$DBNAME=# select pg_relation_filenode($INDEX_NAME); -- select oid from pg_database where datname = $DBNAME; yields 16400
 pg_relation_filenode
----------------------
            105657687
(1 row)

postgres@[host]:/path/to/data/base/16400$ ls -alh 105657687
-rw------- 1 postgres postgres 0 Apr  2 15:16 105657687

The subscriber is on postgres 15.6 and pglogical 2.4.4

A workaround that seems to work fine is to just create the index non-concurrently, in which case pglogical apply on the subscriber blocks on obtaining a RowExclusiveLock until the create index finishes, and the replication slot stays put on the provider.

Based on #344 (comment) (and other guides around using pglogical on DBs with large tables) the expected behavior is for CREATE INDEX CONCURRENTLY to work fine on the subscriber:

You can always create indexes after the initial load. If you are concerned about index build that locks out writes, you can always use CONCURRENTLY.

Nicholas (Nick) Meyer · Answer 1 · Thu Apr 11 2024 07:00:45 GMT+0800 (China Standard Time)

I have not noticed this on all indexes, just on a handful. I would be surprised if this happened every time an update, insert, or delete replicated over while the concurrent index creation was happening, since I do not recall running into this on other DB upgrades between the same postgres versions, but the behavior replicated when I retried index creation on these particular indexes, even after re-initializing the subscriber from scratch on a new machine.