Regression test failed for Toast index test.
japinli opened this issue · comments
Japin Li commented
Hi,
When I try to use orioledb on patches14 on Ubuntu 20.04. I encountered the following errors:
PATH="/tmp_install/home/japin/Codes/OrioleDB/build/orioledb/bin:/home/japin/Codes/oriole-extension:$PATH" LD_LIBRARY_PATH="/tmp_install/home/japin/Codes/OrioleDB/build/orioledb/lib:$LD_LIBRARY_PATH" PGCTLTIMEOUT=900 \
python3 -W ignore::DeprecationWarning -m unittest -v t/toast_index_test.py
test_checkpoint (t.toast_index_test.ToastIndexTest) ... 3.052 s ok
test_checkpoint_at_start (t.toast_index_test.ToastIndexTest) ...
Base directory: /tmp/toast_index_tgsn_loisqrj6
903.974 s ERROR
test_checkpoint_in_middle (t.toast_index_test.ToastIndexTest) ...
Base directory: /tmp/toast_index_tgsn_2ighpbt3
4.484 s ERROR
test_no_checkpoint (t.toast_index_test.ToastIndexTest) ...
Base directory: /tmp/toast_index_tgsn_5kf84hla
4.467 s ERROR
======================================================================
ERROR: test_checkpoint_at_start (t.toast_index_test.ToastIndexTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/japin/.local/lib/python3.8/site-packages/testgres/node.py", line 717, in start
exit_status, out, error = execute_utility(_params, self.utils_log_file, verbose=True)
File "/home/japin/.local/lib/python3.8/site-packages/testgres/utils.py", line 66, in execute_utility
exit_status, out, error = tconf.os_ops.exec_command(args, verbose=True)
File "/home/japin/.local/lib/python3.8/site-packages/testgres/operations/local_ops.py", line 94, in exec_command
raise ExecUtilException(message='Utility exited with non-zero code. Error `{}`'.format(error),
testgres.exceptions.ExecUtilException: Utility exited with non-zero code. Error `b'pg_ctl: server did not start in time\n'`
Command: ['/home/japin/Codes/OrioleDB/build/orioledb/bin/pg_ctl', '-D', '/tmp/toast_index_tgsn_loisqrj6/data', '-l', '/tmp/toast_index_tgsn_loisqrj6/logs/postgresql.log', '-w', 'start']
Exit code: 1
----
b'waiting for server to start....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... stopped waiting\n'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/japin/Codes/oriole-extension/t/toast_index_test.py", line 113, in test_checkpoint_at_start
node.start()
File "/home/japin/.local/lib/python3.8/site-packages/testgres/node.py", line 723, in start
raise_from(StartNodeException(msg, files), e)
File "<string>", line 3, in raise_from
testgres.exceptions.StartNodeException: Cannot start node
/tmp/toast_index_tgsn_loisqrj6/data/postgresql.conf
----
b"\nfsync = off\nmax_worker_processes = 10\nlog_statement = 'all'\nlisten_addresses = '127.0.0.1'\nport = 20066\n\nmax_replication_slots = 10\nmax_wal_senders = 10\n\nhot_standby = on\nwal_keep_size = 320\nwal_level = 'replica'\nshared_preload_libraries = orioledb\n\n"
/tmp/toast_index_tgsn_loisqrj6/data/postgresql.auto.conf
----
b'# Do not edit this file manually!\n# It will be overwritten by the ALTER SYSTEM command.\n'
/tmp/toast_index_tgsn_loisqrj6/data/pg_hba.conf
----
b'local all all trust\nhost all all 127.0.0.1/32 trust\nhost all all ::1/128 trust\nlocal replication all trust\nhost replication all 127.0.0.1/32 trust\nhost replication all ::1/128 trust\nlocal\treplication\tall\t\t\ttrust\nhost\treplication\tall\t127.0.0.1/32\ttrust\nhost\treplication\tall\t::1/128\t\ttrust\nhost\treplication\tall\t127.0.0.0/24\t\ttrust\nhost\tall\tall\t127.0.0.0/24\t\ttrust\n'
/tmp/toast_index_tgsn_loisqrj6/logs/postgresql.log
----
b'2023-11-14 03:46:52.706 GMT [570341] LOG: received immediate shutdown request\n2023-11-14 03:46:52.717 GMT [570341] LOG: database system is shut down\n2023-11-14 03:46:52.872 GMT [570379] LOG: OrioleDB public beta 3 started\n2023-11-14 03:46:52.874 GMT [570379] LOG: starting PostgreSQL 14.9 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0, 64-bit\n2023-11-14 03:46:52.874 GMT [570379] LOG: listening on IPv4 address "127.0.0.1", port 20066\n2023-11-14 03:46:52.874 GMT [570379] LOG: listening on Unix socket "/tmp/.s.PGSQL.20066"\n2023-11-14 03:46:52.876 GMT [570380] LOG: database system was interrupted; last known up at 2023-11-14 03:46:52 GMT\n2023-11-14 03:46:52.877 GMT [570380] LOG: database system was not properly shut down; automatic recovery in progress\n2023-11-14 03:46:52.877 GMT [570381] LOG: orioledb background writer started\n2023-11-14 03:46:52.877 GMT [570380] LOG: orioledb recovery started.\n2023-11-14 03:46:52.879 GMT [570382] LOG: orioledb recovery worker 3 started.\n2023-11-14 03:46:52.880 GMT [570383] LOG: orioledb recovery worker 2 started.\n2023-11-14 03:46:52.880 GMT [570385] LOG: orioledb recovery worker 0 started.\n2023-11-14 03:46:52.881 GMT [570386] LOG: orioledb recovery worker 5 started.\n2023-11-14 03:46:52.881 GMT [570384] LOG: orioledb recovery worker 1 started.\n2023-11-14 03:46:52.882 GMT [570387] LOG: orioledb recovery worker 4 started.\n2023-11-14 03:46:52.883 GMT [570380] LOG: redo starts at 0/17F99F8\n2023-11-14 03:52:17.783 GMT [571106] FATAL: the database system is starting up\n2023-11-14 03:52:20.915 GMT [571114] FATAL: the database system is starting up\n2023-11-14 03:53:25.134 GMT [571555] FATAL: the database system is starting up\n'
======================================================================
ERROR: test_checkpoint_in_middle (t.toast_index_test.ToastIndexTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/japin/.local/lib/python3.8/site-packages/testgres/node.py", line 717, in start
exit_status, out, error = execute_utility(_params, self.utils_log_file, verbose=True)
File "/home/japin/.local/lib/python3.8/site-packages/testgres/utils.py", line 66, in execute_utility
exit_status, out, error = tconf.os_ops.exec_command(args, verbose=True)
File "/home/japin/.local/lib/python3.8/site-packages/testgres/operations/local_ops.py", line 94, in exec_command
raise ExecUtilException(message='Utility exited with non-zero code. Error `{}`'.format(error),
testgres.exceptions.ExecUtilException: Utility exited with non-zero code. Error `b'pg_ctl: could not start server\nExamine the log output.\n'`
Command: ['/home/japin/Codes/OrioleDB/build/orioledb/bin/pg_ctl', '-D', '/tmp/toast_index_tgsn_2ighpbt3/data', '-l', '/tmp/toast_index_tgsn_2ighpbt3/logs/postgresql.log', '-w', 'start']
Exit code: 1
----
b'waiting for server to start.... stopped waiting\n'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/japin/Codes/oriole-extension/t/toast_index_test.py", line 118, in test_checkpoint_in_middle
self.init_table(False)
File "/home/japin/Codes/oriole-extension/t/toast_index_test.py", line 15, in init_table
node.start()
File "/home/japin/.local/lib/python3.8/site-packages/testgres/node.py", line 723, in start
raise_from(StartNodeException(msg, files), e)
File "<string>", line 3, in raise_from
testgres.exceptions.StartNodeException: Cannot start node
/tmp/toast_index_tgsn_2ighpbt3/data/postgresql.conf
----
b"\nfsync = off\nmax_worker_processes = 10\nlog_statement = 'all'\nlisten_addresses = '127.0.0.1'\nport = 20066\n\nmax_replication_slots = 10\nmax_wal_senders = 10\n\nhot_standby = on\nwal_keep_size = 320\nwal_level = 'replica'\nshared_preload_libraries = orioledb\n\n"
/tmp/toast_index_tgsn_2ighpbt3/data/postgresql.auto.conf
----
b'# Do not edit this file manually!\n# It will be overwritten by the ALTER SYSTEM command.\n'
/tmp/toast_index_tgsn_2ighpbt3/data/pg_hba.conf
----
b'local all all trust\nhost all all 127.0.0.1/32 trust\nhost all all ::1/128 trust\nlocal replication all trust\nhost replication all 127.0.0.1/32 trust\nhost replication all ::1/128 trust\nlocal\treplication\tall\t\t\ttrust\nhost\treplication\tall\t127.0.0.1/32\ttrust\nhost\treplication\tall\t::1/128\t\ttrust\nhost\treplication\tall\t127.0.0.0/24\t\ttrust\nhost\tall\tall\t127.0.0.0/24\t\ttrust\n'
/tmp/toast_index_tgsn_2ighpbt3/logs/postgresql.log
----
b'2023-11-14 04:01:58.413 GMT [572689] LOG: OrioleDB public beta 3 started\n2023-11-14 04:01:58.415 GMT [572689] LOG: starting PostgreSQL 14.9 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0, 64-bit\n2023-11-14 04:01:58.415 GMT [572689] LOG: could not bind IPv4 address "127.0.0.1": Address already in use\n2023-11-14 04:01:58.415 GMT [572689] HINT: Is another postmaster already running on port 20066? If not, wait a few seconds and retry.\n2023-11-14 04:01:58.415 GMT [572689] WARNING: could not create listen socket for "127.0.0.1"\n2023-11-14 04:01:58.415 GMT [572689] FATAL: could not create any TCP/IP sockets\n2023-11-14 04:01:58.440 GMT [572689] LOG: database system is shut down\n'
======================================================================
ERROR: test_no_checkpoint (t.toast_index_test.ToastIndexTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/japin/.local/lib/python3.8/site-packages/testgres/node.py", line 717, in start
exit_status, out, error = execute_utility(_params, self.utils_log_file, verbose=True)
File "/home/japin/.local/lib/python3.8/site-packages/testgres/utils.py", line 66, in execute_utility
exit_status, out, error = tconf.os_ops.exec_command(args, verbose=True)
File "/home/japin/.local/lib/python3.8/site-packages/testgres/operations/local_ops.py", line 94, in exec_command
raise ExecUtilException(message='Utility exited with non-zero code. Error `{}`'.format(error),
testgres.exceptions.ExecUtilException: Utility exited with non-zero code. Error `b'pg_ctl: could not start server\nExamine the log output.\n'`
Command: ['/home/japin/Codes/OrioleDB/build/orioledb/bin/pg_ctl', '-D', '/tmp/toast_index_tgsn_5kf84hla/data', '-l', '/tmp/toast_index_tgsn_5kf84hla/logs/postgresql.log', '-w', 'start']
Exit code: 1
----
b'waiting for server to start.... stopped waiting\n'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/japin/Codes/oriole-extension/t/toast_index_test.py", line 74, in test_no_checkpoint
self.init_table(False)
File "/home/japin/Codes/oriole-extension/t/toast_index_test.py", line 15, in init_table
node.start()
File "/home/japin/.local/lib/python3.8/site-packages/testgres/node.py", line 723, in start
raise_from(StartNodeException(msg, files), e)
File "<string>", line 3, in raise_from
testgres.exceptions.StartNodeException: Cannot start node
/tmp/toast_index_tgsn_5kf84hla/data/postgresql.conf
----
b"\nfsync = off\nmax_worker_processes = 10\nlog_statement = 'all'\nlisten_addresses = '127.0.0.1'\nport = 20066\n\nmax_replication_slots = 10\nmax_wal_senders = 10\n\nhot_standby = on\nwal_keep_size = 320\nwal_level = 'replica'\nshared_preload_libraries = orioledb\n\n"
/tmp/toast_index_tgsn_5kf84hla/data/postgresql.auto.conf
----
b'# Do not edit this file manually!\n# It will be overwritten by the ALTER SYSTEM command.\n'
/tmp/toast_index_tgsn_5kf84hla/data/pg_hba.conf
----
b'local all all trust\nhost all all 127.0.0.1/32 trust\nhost all all ::1/128 trust\nlocal replication all trust\nhost replication all 127.0.0.1/32 trust\nhost replication all ::1/128 trust\nlocal\treplication\tall\t\t\ttrust\nhost\treplication\tall\t127.0.0.1/32\ttrust\nhost\treplication\tall\t::1/128\t\ttrust\nhost\treplication\tall\t127.0.0.0/24\t\ttrust\nhost\tall\tall\t127.0.0.0/24\t\ttrust\n'
/tmp/toast_index_tgsn_5kf84hla/logs/postgresql.log
----
b'2023-11-14 04:02:02.879 GMT [572716] LOG: OrioleDB public beta 3 started\n2023-11-14 04:02:02.897 GMT [572716] LOG: starting PostgreSQL 14.9 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0, 64-bit\n2023-11-14 04:02:02.897 GMT [572716] LOG: could not bind IPv4 address "127.0.0.1": Address already in use\n2023-11-14 04:02:02.897 GMT [572716] HINT: Is another postmaster already running on port 20066? If not, wait a few seconds and retry.\n2023-11-14 04:02:02.897 GMT [572716] WARNING: could not create listen socket for "127.0.0.1"\n2023-11-14 04:02:02.897 GMT [572716] FATAL: could not create any TCP/IP sockets\n2023-11-14 04:02:02.922 GMT [572716] LOG: database system is shut down\n'
----------------------------------------------------------------------
Ran 4 tests in 915.983s
FAILED (errors=3)
make: *** [Makefile:199: t/toast_index_test.py] Error 1
It seems the database is crashed, and cannot do recovery:
$ pg_controldata -D /tmp/toast_index_tgsn_loisqrj6/data/ | grep state
Database cluster state: in crash recovery
Is this a bug for toast index under OrioleDB?
Japin Li commented
I find there are two recovery worker always running in find_page()
loops.
top - 17:44:08 up 15 days, 7:12, 1 user, load average: 1.29, 1.98, 2.26
Tasks: 134 total, 3 running, 130 sleeping, 1 stopped, 0 zombie
%Cpu(s): 98.7 us, 1.2 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.2 si, 0.0 st
MiB Mem : 7945.5 total, 4343.3 free, 481.3 used, 3120.9 buff/cache
MiB Swap: 4096.0 total, 4093.7 free, 2.3 used. 7110.2 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
642278 japin 20 0 303056 11020 9180 R 97.4 0.1 13:20.12 postgres: orioledb recovery worker 1
642277 japin 20 0 303056 10684 8732 R 82.1 0.1 11:48.59 postgres: orioledb recovery worker 2
642274 japin 20 0 303644 10464 8444 S 18.5 0.1 2:50.71 postgres: startup recovering 00000001000000000+
642275 japin 20 0 302800 6060 4508 S 0.3 0.1 0:00.09 postgres: orioledb background writer
642276 japin 20 0 302840 6304 4772 S 0.3 0.1 0:01.03 postgres: orioledb recovery worker 3
642280 japin 20 0 302840 6304 4772 S 0.3 0.1 0:01.07 postgres: orioledb recovery worker 4
642281 japin 20 0 302840 6304 4772 S 0.3 0.1 0:01.02 postgres: orioledb recovery worker 5
642272 japin 20 0 302816 5844 4292 S 0.0 0.1 0:00.00 postgres: checkpointer
642273 japin 20 0 302520 4852 3484 S 0.0 0.1 0:00.01 postgres: background writer
642279 japin 20 0 303056 10516 8672 S 0.0 0.1 0:01.07 postgres: orioledb recovery worker 0
Here is the backtrace:
(gdb) bt
#0 find_page (context=0x7fffca171e10, key=0x7fffca171df0, keyType=BTreeKeyNonLeafKey, targetLevel=0) at src/btree/find.c:139
#1 0x00007f60c4921193 in refind_page (context=0x7fffca171e10, key=0x7fffca171df0, keyType=BTreeKeyNonLeafKey, level=0, _blkno=2149, _pageChangeCount=0) at src/btree/find.c:744
#2 0x00007f60c494ddeb in modify_undo_callback (location=2425000, baseItem=0x55eefaf4a480, oxid=57, abort=true, changeCountsValid=true) at src/btree/undo.c:424
#3 0x00007f60c49b884d in walk_undo_range (location=2425000, toLoc=2305843009213693952, buf=0x7fffca1777a0, oxid=57, abort_val=true, onCommitLocation=0x7fffca177748, changeCountsValid=true) at src/transam/undo.c:537
#4 0x00007f60c49b8e8e in walk_undo_stack (oxid=57, toLocation=0x0, abortTrx=true, changeCountsValid=true) at src/transam/undo.c:644
#5 0x00007f60c49b8fee in apply_undo_stack (oxid=57, toLocation=0x0, changeCountsValid=true) at src/transam/undo.c:684
#6 0x00007f60c498eac4 in recovery_finish_current_oxid (csn=2, ptr=23152733, worker_id=2, sync=false) at src/recovery/recovery.c:1195
#7 0x00007f60c4995871 in recovery_queue_process (queue=0x55eefaeec400, id=2) at src/recovery/worker.c:407
#8 0x00007f60c499504b in recovery_worker_main (main_arg=2) at src/recovery/worker.c:212
#9 0x000055eef9c78962 in StartBackgroundWorker () at /home/japin/Codes/OrioleDB/build/../src/backend/postmaster/bgworker.c:858
#10 0x000055eef9c83752 in do_start_bgworker (rw=0x55eefaee7e00) at /home/japin/Codes/OrioleDB/build/../src/backend/postmaster/postmaster.c:5839
#11 0x000055eef9c83b8d in maybe_start_bgworkers () at /home/japin/Codes/OrioleDB/build/../src/backend/postmaster/postmaster.c:6071
#12 0x000055eef9c82ab6 in sigusr1_handler (postgres_signal_arg=10) at /home/japin/Codes/OrioleDB/build/../src/backend/postmaster/postmaster.c:5220
#13 <signal handler called>
#14 0x00007f60c6b30f7a in __GI___select (nfds=7, readfds=0x7fffca178a20, writefds=0x0, exceptfds=0x0, timeout=0x7fffca178990) at ../sysdeps/unix/sysv/linux/select.c:41
#15 0x000055eef9c7e1ee in ServerLoop () at /home/japin/Codes/OrioleDB/build/../src/backend/postmaster/postmaster.c:1777
#16 0x000055eef9c7db0b in PostmasterMain (argc=3, argv=0x55eefaea02a0) at /home/japin/Codes/OrioleDB/build/../src/backend/postmaster/postmaster.c:1485
#17 0x000055eef9b76ce0 in main (argc=3, argv=0x55eefaea02a0) at /home/japin/Codes/OrioleDB/build/../src/backend/main/main.c:202
It seems find_page()
could not find a valid btree page, however, why not it exit the loop?
I try to kill the postmaster, however, the worker1
, worker2
and startup
process cannot be terminated.
$ kill 642271
$ ps -ef | grep postgres
japin 642271 1 0 17:28 ? 00:00:00 /home/japin/Codes/OrioleDB/build/orioledb/bin/postgres -D /tmp/toast_index_tgsn_kih3we61/data
japin 642272 642271 0 17:28 ? 00:00:00 postgres: checkpointer
japin 642274 642271 18 17:28 ? 00:04:25 postgres: startup recovering 000000010000000000000001
japin 642277 642271 77 17:28 ? 00:18:12 postgres: orioledb recovery worker 2
japin 642278 642271 84 17:28 ? 00:19:52 postgres: orioledb recovery worker 1
japin 645187 492271 0 17:52 pts/13 00:00:00 grep --color=auto postgres