matrixorigin / matrixone

Hyperconverged cloud-edge native database

Home Page:https://docs.matrixorigin.cn/en

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Bug]: logservice crashed by "no space left on device" during regression on TKE

aressu1985 opened this issue · comments

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Branch Name

1.2-dev

Commit ID

b6ac22a

Other Environment Information

- Hardware parameters:
3*CN: 16C 64G
1*DN: 16C 64G
3*LOG: 4C 16G
3*PROXY: 3C 7G
- OS type:
- Others:

Actual Behavior

job link:
https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9467464435/job/26105458866

During benchmark regression on TKE, the log service crashed by "no space left on device":
/usr/local/go/src/runtime/proc.go:271"}
{"level":"WARN","time":"2024/06/12 02:26:28.616659 +0000","caller":"fileservice/disk_cache.go:343","msg":"write disk cache error","error":"mkdir /var/lib/matrixone/data/etl-cache/fullsys/logs/2024/06/12: no space left on device"}
{"level":"INFO","time":"2024/06/12 02:26:28.616741 +0000","caller":"motrace/syncer.go:89","msg":"Wait signal done."}
panic: write /var/lib/matrixone/data/logservice-data/00000000-0000-0000-0000-000000000000/nightly-regression-dis-log-0/06166173447481204388/tandb/node-0-131072/000005.idxtmp: no space left on device

goroutine 1 gp=0xc0000081c0 m=7 mp=0xc000506008 [running]:
panic({0x3e772c0?, 0xc00841e830?})
/usr/local/go/src/runtime/panic.go:779 +0x158 fp=0xc00bb8cbb0 sp=0xc00bb8cb00 pc=0x443ab8
go.uber.org/zap/zapcore.CheckWriteAction.OnWrite(0x0?, 0x0?, {0x0?, 0x0?, 0xc00268a060?})
/go/pkg/mod/go.uber.org/zap@v1.24.0/zapcore/entry.go:198 +0x54 fp=0xc00bb8cbd0 sp=0xc00bb8cbb0 pc=0x6d0034
go.uber.org/zap/zapcore.(*CheckWriteAction).OnWrite(0x0?, 0x0?, {0x0?, 0x0?, 0x477a42b?})
:1 +0x2d fp=0xc00bb8cc08 sp=0xc00bb8cbd0 pc=0x6dcaad
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc0026c81a0, {0x0, 0x0, 0x0})
/go/pkg/mod/go.uber.org/zap@v1.24.0/zapcore/entry.go:264 +0x24e fp=0xc00bb8cd98 sp=0xc00bb8cc08 pc=0x6d03ae
go.uber.org/zap.(*SugaredLogger).log(0xc000114008, 0x4, {0x476c49d?, 0x7f8ae870a108?}, {0xc00841e2a0?, 0xc000ddfaf0?, 0xc00bb8ce10?}, {0x0, 0x0, 0x0})
/go/pkg/mod/go.uber.org/zap@v1.24.0/sugar.go:295 +0xec fp=0xc00bb8cdd8 sp=0xc00bb8cd98 pc=0x86750c
go.uber.org/zap.(*SugaredLogger).Panicf(...)
/go/pkg/mod/go.uber.org/zap@v1.24.0/sugar.go:189
github.com/matrixorigin/matrixone/pkg/logutil.DragonboatAdaptLogger.Panicf(...)
/go/src/github.com/matrixorigin/matrixone/pkg/logutil/dragonboat.go:65
github.com/matrixorigin/matrixone/pkg/logutil.(*DragonboatAdaptLogger).Panicf(0xc000ddfad0?, {0x476c49d?, 0x418525?}, {0xc00841e2a0?, 0x3ee2340?, 0x20001?})
:1 +0x55 fp=0xc00bb8ce38 sp=0xc00bb8cdd8 pc=0xa3a6f5
github.com/lni/dragonboat/v4/logger.(*dragonboatLogger).Panicf(0xc003990af0?, {0x476c49d, 0x3}, {0xc00841e2a0, 0x1, 0x1})
/go/pkg/mod/github.com/matrixorigin/dragonboat/v4@v4.0.0-20240312080931-1b40809d7cea/logger/logger.go:132 +0x51 fp=0xc00bb8ce78 sp=0xc00bb8ce38 pc=0xa300d1
github.com/lni/dragonboat/v4.panicNow(...)
/go/pkg/mod/github.com/matrixorigin/dragonboat/v4@v4.0.0-20240312080931-1b40809d7cea/nodehost.go:2230
github.com/lni/dragonboat/v4.(*NodeHost).startShard(0xc000566408, 0x0, 0x0, 0xc00bb8d648, {0x20000, 0x0, 0x1, 0x1, 0xa, 0x1, ...}, ...)
/go/pkg/mod/github.com/matrixorigin/dragonboat/v4@v4.0.0-20240312080931-1b40809d7cea/nodehost.go:1649 +0xd88 fp=0xc00bb8d5c8 sp=0xc00bb8ce78 pc=0x1615388
github.com/lni/dragonboat/v4.(*NodeHost).StartReplica(0xc00265e808?, 0xc00297d790?, 0xbe?, 0xb0?, {0x20000, 0x0, 0x1, 0x1, 0xa, 0x1, ...})
/go/pkg/mod/github.com/matrixorigin/dragonboat/v4@v4.0.0-20240312080931-1b40809d7cea/nodehost.go:508 +0xe5 fp=0xc00bb8d6c0 sp=0xc00bb8d5c8 pc=0x160e585
github.com/matrixorigin/matrixone/pkg/logservice.(*store).startHAKeeperReplica(0xc0047fce08, 0x20000, 0x4?, 0x8?)

And the "Volume Space Usage" of logservice was continuously increasing from a point:
image

Maybe this was caused by some bugs in truncating log record。

mo-log:
https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22pIs%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-branch-reg-b6ac22a%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221718114400000%22,%22to%22:%221718161199000%22%7D%7D%7D&schemaVersion=1&orgId=1

Expected Behavior

No response

Steps to Reproduce

not sure

Additional information

No response

obj应该是insert产生的。还在找复现方法。

这个没有复现,先降级,DELAY到1.2.2

还没复现