[Bug]: cn crashed by fatal "wait latest commit ts failed" during statbility test on distributed mode
aressu1985 opened this issue · comments
Is there an existing issue for the same bug?
- I have checked the existing issues.
Branch Name
1.2-dev
Commit ID
Other Environment Information
- Hardware parameters:
3*CN: 16C 64G
1*DN: 16C 64G
3*LOG: 4C 16G
2*PROXY: 3C 6G
- OS type:
- Others:
Actual Behavior
During statbility test on distributed mode, cn was crashed by fatal :
{"level":"FATAL","time":"2024/06/05 22:09:44.081179 +0000","name":"cn-service.txn","caller":"client/client.go:434","msg":"wait latest commit ts failed","uuid":"65393636-3165-6662-6631-633163326338","error":"waiter is paused","stacktrace":"github.com/matrixorigin/matrixone/pkg/txn/client.(*txnClient).SyncLatestCommitTS\n\t/go/src/github.com/matrixorigin/matrixone/pkg/txn/client/client.go:434\ngithub.com/matrixorigin/matrixone/pkg/sql/compile.(*sqlExecutor).maybeWaitCommittedLogApplied\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/compile/sql_executor.go:154\ngithub.com/matrixorigin/matrixone/pkg/sql/compile.(*sqlExecutor).ExecTxn\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/compile/sql_executor.go:144\ngithub.com/matrixorigin/matrixone/pkg/incrservice.(*sqlStore).Allocate\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/store_sql.go:160\ngithub.com/matrixorigin/matrixone/pkg/incrservice.(*allocator).doAllocate\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/allocator.go:164\ngithub.com/matrixorigin/matrixone/pkg/incrservice.(*allocator).run\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/allocator.go:151\ngithub.com/matrixorigin/matrixone/pkg/common/stopper.(*Stopper).doRunCancelableTask.func1\n\t/go/src/github.com/matrixorigin/matrixone/pkg/common/stopper/stopper.go:277"}
Expected Behavior
No response
Steps to Reproduce
1. run a mo cluster with config in this issue
2. run tpch 10G loop test processes in one independant tenant
3. run tpcc 10 warehouse and 10 ternimals longrunnig test processes in one independant tenant, prepare mode
4. run sysbench mixed cases(insert/delete/update/select) longrunnig test processes with 75 terminals in one independant tenant,non-prepare mode
5. run another sysbench mixed cases(insert/delete/update/select) longrunnig test processe with 75 terminals in one independant tenant,non-prepare mode
Additional information
No response
[2024-06-06 06:09:44.081 FATAL]
06-07-2024-17-35-08_files_list.zip
[2024-06-06 06:14:54 FATAL]
06-07-2024-17-37-50_files_list.zip
[2024-06-06 06:18:44 FATAL]
06-07-2024-17-38-47_files_list.zip
flush 的调度等待时长和执行时长都存在消耗超过预期,复现中
增加日志记录秒级别的flush任务,主要观察两点:1. 任务调度延迟 2. 收集 deletes 的 io 时间
收集delete 的 io 时间过长,修复中
daily 耗时均不超过10s, pull logtail 暂未出现耗时过长的情况
fixed
testing
fxied