matrixorigin / matrixone

Hyperconverged cloud-edge native database

Home Page:https://docs.matrixorigin.cn/en

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Bug]: cn crashed by fatal "wait latest commit ts failed" during statbility test on distributed mode

aressu1985 opened this issue · comments

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Branch Name

1.2-dev

Commit ID

e6b2868

Other Environment Information

- Hardware parameters:
3*CN: 16C 64G
1*DN: 16C 64G
3*LOG: 4C 16G
2*PROXY: 3C 6G
- OS type:
- Others:

Actual Behavior

During statbility test on distributed mode, cn was crashed by fatal :
{"level":"FATAL","time":"2024/06/05 22:09:44.081179 +0000","name":"cn-service.txn","caller":"client/client.go:434","msg":"wait latest commit ts failed","uuid":"65393636-3165-6662-6631-633163326338","error":"waiter is paused","stacktrace":"github.com/matrixorigin/matrixone/pkg/txn/client.(*txnClient).SyncLatestCommitTS\n\t/go/src/github.com/matrixorigin/matrixone/pkg/txn/client/client.go:434\ngithub.com/matrixorigin/matrixone/pkg/sql/compile.(*sqlExecutor).maybeWaitCommittedLogApplied\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/compile/sql_executor.go:154\ngithub.com/matrixorigin/matrixone/pkg/sql/compile.(*sqlExecutor).ExecTxn\n\t/go/src/github.com/matrixorigin/matrixone/pkg/sql/compile/sql_executor.go:144\ngithub.com/matrixorigin/matrixone/pkg/incrservice.(*sqlStore).Allocate\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/store_sql.go:160\ngithub.com/matrixorigin/matrixone/pkg/incrservice.(*allocator).doAllocate\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/allocator.go:164\ngithub.com/matrixorigin/matrixone/pkg/incrservice.(*allocator).run\n\t/go/src/github.com/matrixorigin/matrixone/pkg/incrservice/allocator.go:151\ngithub.com/matrixorigin/matrixone/pkg/common/stopper.(*Stopper).doRunCancelableTask.func1\n\t/go/src/github.com/matrixorigin/matrixone/pkg/common/stopper/stopper.go:277"}

mo-log:
https://shanghai.idc.matrixorigin.cn:30001/explore?panes=%7B%22Jyy%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-e6b2868-20240605224953%5C%22%7D%20%7C%3D%20%60FATAL%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221717623352647%22,%22to%22:%221717626935791%22%7D%7D%7D&schemaVersion=1&orgId=1

Expected Behavior

No response

Steps to Reproduce

1. run a mo cluster with config in this issue
2. run tpch 10G loop test processes in one independant tenant
3. run tpcc 10 warehouse and 10 ternimals longrunnig test processes in one independant tenant, prepare mode
4. run sysbench mixed cases(insert/delete/update/select) longrunnig test processes with 75 terminals in one independant tenant,non-prepare mode
5. run another sysbench mixed cases(insert/delete/update/select) longrunnig test processe with  75 terminals in one independant tenant,non-prepare mode

Additional information

No response

[2024-06-06 06:09:44.081 FATAL]
06-07-2024-17-35-08_files_list.zip

[2024-06-06 06:14:54 FATAL]
06-07-2024-17-37-50_files_list.zip

[2024-06-06 06:18:44 FATAL]
06-07-2024-17-38-47_files_list.zip

flush 的调度等待时长和执行时长都存在消耗超过预期,复现中

增加日志记录秒级别的flush任务,主要观察两点:1. 任务调度延迟 2. 收集 deletes 的 io 时间

收集delete 的 io 时间过长,修复中

pr前
image

pr后
image

flush时间已大幅减少

daily 耗时均不超过10s, pull logtail 暂未出现耗时过长的情况

fixed

testing

fxied