reorg / pg_repack

Reorganize tables in PostgreSQL databases with minimal locks

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

LOCK TABLE can only be used in transaction blocks

ZhangHien opened this issue · comments

commented

I'm sorry that English is not my first language. I will try my best to describe this issue.

I am using pg_repack 1.4.8 to repack a partitioned table which has 100 partitions in PG 14.10:
pg_repack -Utest_user --dbname=ddl_full0 -k --parent-table=part_range_2_level.t1

Some partitions were repacked successfully at the beginning. But I got ERROR and WARNING when repacking the partition t1_p_0_p_2:

# ok
INFO: repacking table "part_range_2_level.t1_p_0_p_0"
INFO: repacking table "part_range_2_level.t1_p_0_p_1"
INFO: repacking table "part_range_2_level.t1_p_0_p_10"
NOTICE: Waiting for 1 transactions to finish. First PID: 28374

# ERROR and WARNING
INFO: repacking table "part_range_2_level.t1_p_0_p_2"
NOTICE: Canceled 1 unsafe queries
WARNING:  SET LOCAL can only be used in transaction blocks
WARNING: ERROR:  LOCK TABLE can only be used in transaction blocks

# Ignore the error above and repack other partitions
INFO: repacking table "part_range_2_level.t1_p_0_p_3"
NOTICE: Waiting for 6 transactions to finish. First PID: 10030218
INFO: repacking table "part_range_2_level.t1_p_0_p_4"
......

I find that pg_repack tried to get share lock on partition t1_p_0_p_2 for twice. It failed with timeout the first time and retried. The second time it failed with the error LOCK TABLE can only be used in transaction blocks:

# first lock request, fail with timeout
2024-02-23 11:24:50.837 UTC 30096 ddl_full0  55P03 0 ERROR:  canceling statement due to lock timeout
2024-02-23 11:24:50.837 UTC 30096 ddl_full0  55P03 0 STATEMENT:  LOCK TABLE part_range_2_level.t1_p_0_p_2 IN ACCESS SHARE MODE

# second lock request, fail with ERROR: LOCK TABLE can only be used in transaction blocks
2024-02-23 11:24:50.984 UTC 30096 ddl_full0  25P01 0 WARNING:  SET LOCAL can only be used in transaction blocks
2024-02-23 11:24:51.018 UTC 30096 ddl_full0  25P01 0 ERROR:  LOCK TABLE can only be used in transaction blocks
2024-02-23 11:24:51.018 UTC 30096 ddl_full0  25P01 0 STATEMENT:  LOCK TABLE part_range_2_level.t1_p_0_p_2 IN ACCESS SHARE MODE

Maybe the txn was committed/rolled back before the second lock request? The audit log analysis is as follow. After the first lock request failed with timeout, the SERIALIZABLE txn was rolled back. When it tried to lock the table the second time, it was not inside a txn because the txn was already rolled back before.

# The 6th column in audit.log is the error code and 00000 means ok, non-zero means ERROR.
# begin a txn
2024-02-23 11:24:50.450 UTC 30096 ddl_full0  00000 0 LOG:  statement: /*pg_catalog, pg_temp, public*/ BEGIN ISOLATION LEVEL SERIALIZABLE
2024-02-23 11:24:50.551 UTC 30096 ddl_full0  00000 0 LOG:  statement: /*pg_catalog, pg_temp, public*/ SELECT set_config('work_mem', current_setting('maintenance_work_mem'), true)
2024-02-23 11:24:50.631 UTC 30096 ddl_full0  00000 0 LOG:  statement: /*pg_catalog, pg_temp, public*/ DELETE FROM repack.log_27089

# lock table failed with timeout
2024-02-23 11:24:50.670 UTC 30096 ddl_full0  00000 0 LOG:  statement: /*pg_catalog, pg_temp, public*/ SET LOCAL lock_timeout = 100
2024-02-23 11:24:50.837 UTC 30096 ddl_full0  55P03 0 LOG:  statement: /*pg_catalog, pg_temp, public*/ LOCK TABLE part_range_2_level.t1_p_0_p_2 IN ACCESS SHARE MODE

# rollback the txn
2024-02-23 11:24:50.837 UTC 30096 ddl_full0  00000 0 LOG:  statement: /*pg_catalog, pg_temp, public*/ ROLLBACK

# try to lock table again, but this time we are not inside a txn block
2024-02-23 11:24:50.984 UTC 30096 ddl_full0  00000 0 LOG:  statement: /*pg_catalog, pg_temp, public*/ SET LOCAL lock_timeout = 200
2024-02-23 11:24:51.018 UTC 30096 ddl_full0  25P01 0 LOG:  statement: /*pg_catalog, pg_temp, public*/ LOCK TABLE part_range_2_level.t1_p_0_p_2 IN ACCESS SHARE MODE
2024-02-23 11:24:51.018 UTC 30096 ddl_full0  00000 0 LOG:  statement: /*pg_catalog, pg_temp, public*/ RESET lock_timeout

To fix this issue, maybe we should not rollback the txn when timeout? Or we should start a new txn before we retry the lock request? Actually pg_repack has already done this when locking with EXCLUSIVE mode, We just need to do the same thing when locking with SHARE mode in repack_one_table() and lock_access_share().

Let's see the following case of locking with EXCLUSIVE mode. If lock request failed, it rollbacks the txn and next time it starts a new txn before the lock request, so there will be no ERROR LOCK TABLE can only be used in transaction blocks.

2024-02-23 11:26:51.137 UTC 30096 ddl_full0  00000 0 LOG:  statement: /*pg_catalog, pg_temp, public*/ BEGIN ISOLATION LEVEL READ COMMITTED
2024-02-23 11:26:51.137 UTC 30096 ddl_full0  00000 0 LOG:  statement: /*pg_catalog, pg_temp, public*/ SET LOCAL lock_timeout = 100
2024-02-23 11:26:51.338 UTC 30096 ddl_full0  55P03 2619 LOG:  statement: /*pg_catalog, pg_temp, public*/ LOCK TABLE part_range_2_level.t1_p_0_p_4 IN ACCESS EXCLUSIVE MODE
2024-02-23 11:26:51.338 UTC 30096 ddl_full0  00000 0 LOG:  statement: /*pg_catalog, pg_temp, public*/ ROLLBACK
2024-02-23 11:26:51.338 UTC 30096 ddl_full0  00000 0 LOG:  statement: /*pg_catalog, pg_temp, public*/ BEGIN ISOLATION LEVEL READ COMMITTED
2024-02-23 11:26:51.338 UTC 30096 ddl_full0  00000 0 LOG:  statement: /*pg_catalog, pg_temp, public*/ SET LOCAL lock_timeout = 200
2024-02-23 11:26:51.671 UTC 30096 ddl_full0  55P03 2620 LOG:  statement: /*pg_catalog, pg_temp, public*/ LOCK TABLE part_range_2_level.t1_p_0_p_4 IN ACCESS EXCLUSIVE MODE
2024-02-23 11:26:51.671 UTC 30096 ddl_full0  00000 0 LOG:  statement: /*pg_catalog, pg_temp, public*/ ROLLBACK
2024-02-23 11:27:00.209 UTC 30096 ddl_full0  00000 0 LOG:  statement: /*pg_catalog, pg_temp, public*/ BEGIN ISOLATION LEVEL READ COMMITTED
2024-02-23 11:27:00.209 UTC 30096 ddl_full0  00000 0 LOG:  statement: /*pg_catalog, pg_temp, public*/ SET LOCAL lock_timeout = 1000
2024-02-23 11:27:01.343 UTC 30096 ddl_full0  55P03 2650 LOG:  statement: /*pg_catalog, pg_temp, public*/ LOCK TABLE part_range_2_level.t1_p_0_p_4 IN ACCESS EXCLUSIVE MODE
2024-02-23 11:27:01.343 UTC 30096 ddl_full0  00000 0 LOG:  statement: /*pg_catalog, pg_temp, public*/ ROLLBACK
2024-02-23 11:27:01.343 UTC 30096 ddl_full0  00000 0 LOG:  statement: /*pg_catalog, pg_temp, public*/ BEGIN ISOLATION LEVEL READ COMMITTED
2024-02-23 11:27:01.343 UTC 30096 ddl_full0  00000 0 LOG:  statement: /*pg_catalog, pg_temp, public*/ SET LOCAL lock_timeout = 1000
2024-02-23 11:27:02.476 UTC 30096 ddl_full0  55P03 2653 LOG:  statement: /*pg_catalog, pg_temp, public*/ LOCK TABLE part_range_2_level.t1_p_0_p_4 IN ACCESS EXCLUSIVE MODE
2024-02-23 11:27:02.476 UTC 30096 ddl_full0  00000 0 LOG:  statement: /*pg_catalog, pg_temp, public*/ ROLLBACK

I am sorry I cannot give a script which will reproduce the issue 100% because I only got this error once and I cannot even reproduce it myself. When the error occured during repacking, I was executing concurrent DDL/DML on the same table with pgbench and the concurrent DDL/DML could be the root cause of the first lock request fail in pg_repack.

The DDL to create the partitioned table and its partitions is as follow. Maybe this issue is not related to partitioned table and it's ok to use a non-partitioned table instead.

-- partitioned table
CREATE TABLE IF NOT EXISTS t1 (
    id INT PRIMARY KEY,  a TEXT
) PARTITION BY RANGE (id);
-- 100 partitions
DO $$
BEGIN
    FOR i IN 0..10
    LOOP
        EXECUTE 'CREATE TABLE IF NOT EXISTS t1_p_' || i || ' PARTITION OF t1 FOR VALUES FROM (' || i * 10000 || ') TO (' || (i + 1) * 10000 || ')' || ' PARTITION BY RANGE (id);';
        FOR j IN 0..10
        LOOP
            EXECUTE 'CREATE TABLE IF NOT EXISTS t1_p_' || i || '_p_' || j ||' PARTITION OF t1_p_' || i || ' FOR VALUES FROM (' || i * 10000 + j * 1000 || ') TO (' || i * 10000 + (j + 1) * 1000 || ') WITH (toast_tuple_target = 128);';
        END LOOP;
    END LOOP;
END $$;

Thank you for reporting the issue. I've reproduced this issue with debugger.

After the first lock request failed with timeout, the SERIALIZABLE txn was rolled back. When it tried to lock the table the second time, it was not inside a txn because the txn was already rolled back before.

This matches my analysis. In lock_access_share() we call kill_ddl() unconditionally and then execute LOCK TABLE %s IN ACCESS SHARE MODE. But there is a window between two calls where another DDL that acquires an AEL on the table can be executed. In this case, we end up trying it again but we got the error LOCK TABLE can only be used in transaction blocks as the transaction is already rolled back.

The issue was resolved by the PR #401. I'm closing the issue, please feel free to reopen it if it still persist.