orioledb / orioledb

OrioleDB – building a modern cloud-native storage engine (... and solving some PostgreSQL wicked problems)  🇺🇦

Home Page:https://orioledb.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Regression test failed for Toast index test.

japinli opened this issue · comments

Hi,

When I try to use orioledb on patches14 on Ubuntu 20.04. I encountered the following errors:

PATH="/tmp_install/home/japin/Codes/OrioleDB/build/orioledb/bin:/home/japin/Codes/oriole-extension:$PATH" LD_LIBRARY_PATH="/tmp_install/home/japin/Codes/OrioleDB/build/orioledb/lib:$LD_LIBRARY_PATH"  PGCTLTIMEOUT=900 \
python3 -W ignore::DeprecationWarning -m unittest -v t/toast_index_test.py
test_checkpoint (t.toast_index_test.ToastIndexTest) ... 3.052 s ok
test_checkpoint_at_start (t.toast_index_test.ToastIndexTest) ...
Base directory: /tmp/toast_index_tgsn_loisqrj6
903.974 s ERROR
test_checkpoint_in_middle (t.toast_index_test.ToastIndexTest) ...
Base directory: /tmp/toast_index_tgsn_2ighpbt3
4.484 s ERROR
test_no_checkpoint (t.toast_index_test.ToastIndexTest) ...
Base directory: /tmp/toast_index_tgsn_5kf84hla
4.467 s ERROR

======================================================================
ERROR: test_checkpoint_at_start (t.toast_index_test.ToastIndexTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/japin/.local/lib/python3.8/site-packages/testgres/node.py", line 717, in start
    exit_status, out, error = execute_utility(_params, self.utils_log_file, verbose=True)
  File "/home/japin/.local/lib/python3.8/site-packages/testgres/utils.py", line 66, in execute_utility
    exit_status, out, error = tconf.os_ops.exec_command(args, verbose=True)
  File "/home/japin/.local/lib/python3.8/site-packages/testgres/operations/local_ops.py", line 94, in exec_command
    raise ExecUtilException(message='Utility exited with non-zero code. Error `{}`'.format(error),
testgres.exceptions.ExecUtilException: Utility exited with non-zero code. Error `b'pg_ctl: server did not start in time\n'`
Command: ['/home/japin/Codes/OrioleDB/build/orioledb/bin/pg_ctl', '-D', '/tmp/toast_index_tgsn_loisqrj6/data', '-l', '/tmp/toast_index_tgsn_loisqrj6/logs/postgresql.log', '-w', 'start']
Exit code: 1
----
b'waiting for server to startstopped waiting\n'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/japin/Codes/oriole-extension/t/toast_index_test.py", line 113, in test_checkpoint_at_start
    node.start()
  File "/home/japin/.local/lib/python3.8/site-packages/testgres/node.py", line 723, in start
    raise_from(StartNodeException(msg, files), e)
  File "<string>", line 3, in raise_from
testgres.exceptions.StartNodeException: Cannot start node
/tmp/toast_index_tgsn_loisqrj6/data/postgresql.conf
----
b"\nfsync = off\nmax_worker_processes = 10\nlog_statement = 'all'\nlisten_addresses = '127.0.0.1'\nport = 20066\n\nmax_replication_slots = 10\nmax_wal_senders = 10\n\nhot_standby = on\nwal_keep_size = 320\nwal_level = 'replica'\nshared_preload_libraries = orioledb\n\n"

/tmp/toast_index_tgsn_loisqrj6/data/postgresql.auto.conf
----
b'# Do not edit this file manually!\n# It will be overwritten by the ALTER SYSTEM command.\n'

/tmp/toast_index_tgsn_loisqrj6/data/pg_hba.conf
----
b'local   all             all                                     trust\nhost    all             all             127.0.0.1/32            trust\nhost    all             all             ::1/128                 trust\nlocal   replication     all                                     trust\nhost    replication     all             127.0.0.1/32            trust\nhost    replication     all             ::1/128                 trust\nlocal\treplication\tall\t\t\ttrust\nhost\treplication\tall\t127.0.0.1/32\ttrust\nhost\treplication\tall\t::1/128\t\ttrust\nhost\treplication\tall\t127.0.0.0/24\t\ttrust\nhost\tall\tall\t127.0.0.0/24\t\ttrust\n'

/tmp/toast_index_tgsn_loisqrj6/logs/postgresql.log
----
b'2023-11-14 03:46:52.706 GMT [570341] LOG:  received immediate shutdown request\n2023-11-14 03:46:52.717 GMT [570341] LOG:  database system is shut down\n2023-11-14 03:46:52.872 GMT [570379] LOG:  OrioleDB public beta 3 started\n2023-11-14 03:46:52.874 GMT [570379] LOG:  starting PostgreSQL 14.9 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0, 64-bit\n2023-11-14 03:46:52.874 GMT [570379] LOG:  listening on IPv4 address "127.0.0.1", port 20066\n2023-11-14 03:46:52.874 GMT [570379] LOG:  listening on Unix socket "/tmp/.s.PGSQL.20066"\n2023-11-14 03:46:52.876 GMT [570380] LOG:  database system was interrupted; last known up at 2023-11-14 03:46:52 GMT\n2023-11-14 03:46:52.877 GMT [570380] LOG:  database system was not properly shut down; automatic recovery in progress\n2023-11-14 03:46:52.877 GMT [570381] LOG:  orioledb background writer started\n2023-11-14 03:46:52.877 GMT [570380] LOG:  orioledb recovery started.\n2023-11-14 03:46:52.879 GMT [570382] LOG:  orioledb recovery worker 3 started.\n2023-11-14 03:46:52.880 GMT [570383] LOG:  orioledb recovery worker 2 started.\n2023-11-14 03:46:52.880 GMT [570385] LOG:  orioledb recovery worker 0 started.\n2023-11-14 03:46:52.881 GMT [570386] LOG:  orioledb recovery worker 5 started.\n2023-11-14 03:46:52.881 GMT [570384] LOG:  orioledb recovery worker 1 started.\n2023-11-14 03:46:52.882 GMT [570387] LOG:  orioledb recovery worker 4 started.\n2023-11-14 03:46:52.883 GMT [570380] LOG:  redo starts at 0/17F99F8\n2023-11-14 03:52:17.783 GMT [571106] FATAL:  the database system is starting up\n2023-11-14 03:52:20.915 GMT [571114] FATAL:  the database system is starting up\n2023-11-14 03:53:25.134 GMT [571555] FATAL:  the database system is starting up\n'


======================================================================
ERROR: test_checkpoint_in_middle (t.toast_index_test.ToastIndexTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/japin/.local/lib/python3.8/site-packages/testgres/node.py", line 717, in start
    exit_status, out, error = execute_utility(_params, self.utils_log_file, verbose=True)
  File "/home/japin/.local/lib/python3.8/site-packages/testgres/utils.py", line 66, in execute_utility
    exit_status, out, error = tconf.os_ops.exec_command(args, verbose=True)
  File "/home/japin/.local/lib/python3.8/site-packages/testgres/operations/local_ops.py", line 94, in exec_command
    raise ExecUtilException(message='Utility exited with non-zero code. Error `{}`'.format(error),
testgres.exceptions.ExecUtilException: Utility exited with non-zero code. Error `b'pg_ctl: could not start server\nExamine the log output.\n'`
Command: ['/home/japin/Codes/OrioleDB/build/orioledb/bin/pg_ctl', '-D', '/tmp/toast_index_tgsn_2ighpbt3/data', '-l', '/tmp/toast_index_tgsn_2ighpbt3/logs/postgresql.log', '-w', 'start']
Exit code: 1
----
b'waiting for server to start.... stopped waiting\n'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/japin/Codes/oriole-extension/t/toast_index_test.py", line 118, in test_checkpoint_in_middle
    self.init_table(False)
  File "/home/japin/Codes/oriole-extension/t/toast_index_test.py", line 15, in init_table
    node.start()
  File "/home/japin/.local/lib/python3.8/site-packages/testgres/node.py", line 723, in start
    raise_from(StartNodeException(msg, files), e)
  File "<string>", line 3, in raise_from
testgres.exceptions.StartNodeException: Cannot start node
/tmp/toast_index_tgsn_2ighpbt3/data/postgresql.conf
----
b"\nfsync = off\nmax_worker_processes = 10\nlog_statement = 'all'\nlisten_addresses = '127.0.0.1'\nport = 20066\n\nmax_replication_slots = 10\nmax_wal_senders = 10\n\nhot_standby = on\nwal_keep_size = 320\nwal_level = 'replica'\nshared_preload_libraries = orioledb\n\n"

/tmp/toast_index_tgsn_2ighpbt3/data/postgresql.auto.conf
----
b'# Do not edit this file manually!\n# It will be overwritten by the ALTER SYSTEM command.\n'

/tmp/toast_index_tgsn_2ighpbt3/data/pg_hba.conf
----
b'local   all             all                                     trust\nhost    all             all             127.0.0.1/32            trust\nhost    all             all             ::1/128                 trust\nlocal   replication     all                                     trust\nhost    replication     all             127.0.0.1/32            trust\nhost    replication     all             ::1/128                 trust\nlocal\treplication\tall\t\t\ttrust\nhost\treplication\tall\t127.0.0.1/32\ttrust\nhost\treplication\tall\t::1/128\t\ttrust\nhost\treplication\tall\t127.0.0.0/24\t\ttrust\nhost\tall\tall\t127.0.0.0/24\t\ttrust\n'

/tmp/toast_index_tgsn_2ighpbt3/logs/postgresql.log
----
b'2023-11-14 04:01:58.413 GMT [572689] LOG:  OrioleDB public beta 3 started\n2023-11-14 04:01:58.415 GMT [572689] LOG:  starting PostgreSQL 14.9 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0, 64-bit\n2023-11-14 04:01:58.415 GMT [572689] LOG:  could not bind IPv4 address "127.0.0.1": Address already in use\n2023-11-14 04:01:58.415 GMT [572689] HINT:  Is another postmaster already running on port 20066? If not, wait a few seconds and retry.\n2023-11-14 04:01:58.415 GMT [572689] WARNING:  could not create listen socket for "127.0.0.1"\n2023-11-14 04:01:58.415 GMT [572689] FATAL:  could not create any TCP/IP sockets\n2023-11-14 04:01:58.440 GMT [572689] LOG:  database system is shut down\n'


======================================================================
ERROR: test_no_checkpoint (t.toast_index_test.ToastIndexTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/japin/.local/lib/python3.8/site-packages/testgres/node.py", line 717, in start
    exit_status, out, error = execute_utility(_params, self.utils_log_file, verbose=True)
  File "/home/japin/.local/lib/python3.8/site-packages/testgres/utils.py", line 66, in execute_utility
    exit_status, out, error = tconf.os_ops.exec_command(args, verbose=True)
  File "/home/japin/.local/lib/python3.8/site-packages/testgres/operations/local_ops.py", line 94, in exec_command
    raise ExecUtilException(message='Utility exited with non-zero code. Error `{}`'.format(error),
testgres.exceptions.ExecUtilException: Utility exited with non-zero code. Error `b'pg_ctl: could not start server\nExamine the log output.\n'`
Command: ['/home/japin/Codes/OrioleDB/build/orioledb/bin/pg_ctl', '-D', '/tmp/toast_index_tgsn_5kf84hla/data', '-l', '/tmp/toast_index_tgsn_5kf84hla/logs/postgresql.log', '-w', 'start']
Exit code: 1
----
b'waiting for server to start.... stopped waiting\n'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/japin/Codes/oriole-extension/t/toast_index_test.py", line 74, in test_no_checkpoint
    self.init_table(False)
  File "/home/japin/Codes/oriole-extension/t/toast_index_test.py", line 15, in init_table
    node.start()
  File "/home/japin/.local/lib/python3.8/site-packages/testgres/node.py", line 723, in start
    raise_from(StartNodeException(msg, files), e)
  File "<string>", line 3, in raise_from
testgres.exceptions.StartNodeException: Cannot start node
/tmp/toast_index_tgsn_5kf84hla/data/postgresql.conf
----
b"\nfsync = off\nmax_worker_processes = 10\nlog_statement = 'all'\nlisten_addresses = '127.0.0.1'\nport = 20066\n\nmax_replication_slots = 10\nmax_wal_senders = 10\n\nhot_standby = on\nwal_keep_size = 320\nwal_level = 'replica'\nshared_preload_libraries = orioledb\n\n"

/tmp/toast_index_tgsn_5kf84hla/data/postgresql.auto.conf
----
b'# Do not edit this file manually!\n# It will be overwritten by the ALTER SYSTEM command.\n'

/tmp/toast_index_tgsn_5kf84hla/data/pg_hba.conf
----
b'local   all             all                                     trust\nhost    all             all             127.0.0.1/32            trust\nhost    all             all             ::1/128                 trust\nlocal   replication     all                                     trust\nhost    replication     all             127.0.0.1/32            trust\nhost    replication     all             ::1/128                 trust\nlocal\treplication\tall\t\t\ttrust\nhost\treplication\tall\t127.0.0.1/32\ttrust\nhost\treplication\tall\t::1/128\t\ttrust\nhost\treplication\tall\t127.0.0.0/24\t\ttrust\nhost\tall\tall\t127.0.0.0/24\t\ttrust\n'

/tmp/toast_index_tgsn_5kf84hla/logs/postgresql.log
----
b'2023-11-14 04:02:02.879 GMT [572716] LOG:  OrioleDB public beta 3 started\n2023-11-14 04:02:02.897 GMT [572716] LOG:  starting PostgreSQL 14.9 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0, 64-bit\n2023-11-14 04:02:02.897 GMT [572716] LOG:  could not bind IPv4 address "127.0.0.1": Address already in use\n2023-11-14 04:02:02.897 GMT [572716] HINT:  Is another postmaster already running on port 20066? If not, wait a few seconds and retry.\n2023-11-14 04:02:02.897 GMT [572716] WARNING:  could not create listen socket for "127.0.0.1"\n2023-11-14 04:02:02.897 GMT [572716] FATAL:  could not create any TCP/IP sockets\n2023-11-14 04:02:02.922 GMT [572716] LOG:  database system is shut down\n'


----------------------------------------------------------------------
Ran 4 tests in 915.983s

FAILED (errors=3)
make: *** [Makefile:199: t/toast_index_test.py] Error 1

It seems the database is crashed, and cannot do recovery:

$ pg_controldata -D /tmp/toast_index_tgsn_loisqrj6/data/ | grep state
Database cluster state:               in crash recovery

Is this a bug for toast index under OrioleDB?

I find there are two recovery worker always running in find_page() loops.

top - 17:44:08 up 15 days,  7:12,  1 user,  load average: 1.29, 1.98, 2.26
Tasks: 134 total,   3 running, 130 sleeping,   1 stopped,   0 zombie
%Cpu(s): 98.7 us,  1.2 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st
MiB Mem :   7945.5 total,   4343.3 free,    481.3 used,   3120.9 buff/cache
MiB Swap:   4096.0 total,   4093.7 free,      2.3 used.   7110.2 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 642278 japin     20   0  303056  11020   9180 R  97.4   0.1  13:20.12 postgres: orioledb recovery worker 1
 642277 japin     20   0  303056  10684   8732 R  82.1   0.1  11:48.59 postgres: orioledb recovery worker 2
 642274 japin     20   0  303644  10464   8444 S  18.5   0.1   2:50.71 postgres: startup recovering 00000001000000000+
 642275 japin     20   0  302800   6060   4508 S   0.3   0.1   0:00.09 postgres: orioledb background writer
 642276 japin     20   0  302840   6304   4772 S   0.3   0.1   0:01.03 postgres: orioledb recovery worker 3
 642280 japin     20   0  302840   6304   4772 S   0.3   0.1   0:01.07 postgres: orioledb recovery worker 4
 642281 japin     20   0  302840   6304   4772 S   0.3   0.1   0:01.02 postgres: orioledb recovery worker 5
 642272 japin     20   0  302816   5844   4292 S   0.0   0.1   0:00.00 postgres: checkpointer
 642273 japin     20   0  302520   4852   3484 S   0.0   0.1   0:00.01 postgres: background writer
 642279 japin     20   0  303056  10516   8672 S   0.0   0.1   0:01.07 postgres: orioledb recovery worker 0

Here is the backtrace:

(gdb) bt
#0  find_page (context=0x7fffca171e10, key=0x7fffca171df0, keyType=BTreeKeyNonLeafKey, targetLevel=0) at src/btree/find.c:139
#1  0x00007f60c4921193 in refind_page (context=0x7fffca171e10, key=0x7fffca171df0, keyType=BTreeKeyNonLeafKey, level=0, _blkno=2149, _pageChangeCount=0) at src/btree/find.c:744
#2  0x00007f60c494ddeb in modify_undo_callback (location=2425000, baseItem=0x55eefaf4a480, oxid=57, abort=true, changeCountsValid=true) at src/btree/undo.c:424
#3  0x00007f60c49b884d in walk_undo_range (location=2425000, toLoc=2305843009213693952, buf=0x7fffca1777a0, oxid=57, abort_val=true, onCommitLocation=0x7fffca177748, changeCountsValid=true) at src/transam/undo.c:537
#4  0x00007f60c49b8e8e in walk_undo_stack (oxid=57, toLocation=0x0, abortTrx=true, changeCountsValid=true) at src/transam/undo.c:644
#5  0x00007f60c49b8fee in apply_undo_stack (oxid=57, toLocation=0x0, changeCountsValid=true) at src/transam/undo.c:684
#6  0x00007f60c498eac4 in recovery_finish_current_oxid (csn=2, ptr=23152733, worker_id=2, sync=false) at src/recovery/recovery.c:1195
#7  0x00007f60c4995871 in recovery_queue_process (queue=0x55eefaeec400, id=2) at src/recovery/worker.c:407
#8  0x00007f60c499504b in recovery_worker_main (main_arg=2) at src/recovery/worker.c:212
#9  0x000055eef9c78962 in StartBackgroundWorker () at /home/japin/Codes/OrioleDB/build/../src/backend/postmaster/bgworker.c:858
#10 0x000055eef9c83752 in do_start_bgworker (rw=0x55eefaee7e00) at /home/japin/Codes/OrioleDB/build/../src/backend/postmaster/postmaster.c:5839
#11 0x000055eef9c83b8d in maybe_start_bgworkers () at /home/japin/Codes/OrioleDB/build/../src/backend/postmaster/postmaster.c:6071
#12 0x000055eef9c82ab6 in sigusr1_handler (postgres_signal_arg=10) at /home/japin/Codes/OrioleDB/build/../src/backend/postmaster/postmaster.c:5220
#13 <signal handler called>
#14 0x00007f60c6b30f7a in __GI___select (nfds=7, readfds=0x7fffca178a20, writefds=0x0, exceptfds=0x0, timeout=0x7fffca178990) at ../sysdeps/unix/sysv/linux/select.c:41
#15 0x000055eef9c7e1ee in ServerLoop () at /home/japin/Codes/OrioleDB/build/../src/backend/postmaster/postmaster.c:1777
#16 0x000055eef9c7db0b in PostmasterMain (argc=3, argv=0x55eefaea02a0) at /home/japin/Codes/OrioleDB/build/../src/backend/postmaster/postmaster.c:1485
#17 0x000055eef9b76ce0 in main (argc=3, argv=0x55eefaea02a0) at /home/japin/Codes/OrioleDB/build/../src/backend/main/main.c:202

It seems find_page() could not find a valid btree page, however, why not it exit the loop?

I try to kill the postmaster, however, the worker1, worker2 and startup process cannot be terminated.

$ kill 642271
$ ps -ef | grep postgres
japin     642271       1  0 17:28 ?        00:00:00 /home/japin/Codes/OrioleDB/build/orioledb/bin/postgres -D /tmp/toast_index_tgsn_kih3we61/data
japin     642272  642271  0 17:28 ?        00:00:00 postgres: checkpointer
japin     642274  642271 18 17:28 ?        00:04:25 postgres: startup recovering 000000010000000000000001
japin     642277  642271 77 17:28 ?        00:18:12 postgres: orioledb recovery worker 2
japin     642278  642271 84 17:28 ?        00:19:52 postgres: orioledb recovery worker 1
japin     645187  492271  0 17:52 pts/13   00:00:00 grep --color=auto postgres