[Bug]: [date 6.6]tke regression: tpcc 500ware 1000threads test cn oom
heni02 opened this issue · comments
Is there an existing issue for the same bug?
- I have checked the existing issues.
Branch Name
main
Commit ID
Other Environment Information
- Hardware parameters:
- OS type:
- Others:
Actual Behavior
job:https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9402783044/job/25908076699
Expected Behavior
No response
Steps to Reproduce
tke regression tpcc 500ware 1000threads test
Additional information
No response
mo 自带的profile信息,抓取的两个时间2024-06-07 04:46:34 2024-06-07 04:47:03,oom前和oom后
hn_download_tmp.zip
date 6.7 regression也复现了 cc @reusee
job:
https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9418697195/job/25953862790
从这个指标看,是 cn 的 go 堆占用很多。在达到 54.2G 之后,发生了 OOM。
估计还是 mpool 分配的内存太多的问题。
https://grafana.ci.matrixorigin.cn/goto/noMbrI8SR?orgId=1
从 heap profile 看,也是 go 堆占用高。最高时到了 91.9GB。次高峰是 47.2GB。这个量加上堆外的内存,是超过限额的了。
后续的优化方向,是mpool的重构。
mpool重构中
大致可以确定是goroutine泄露,具体的泄露路径,需要pyroscope开启goroutine profile,才能看到。
继续优化
继续优化
已优化