Memory cgroup out of memory will happen when training a job for few days

Question

Memory cgroup out of memory will happen when training a job for few days

HaoLiuHust opened this issue 2 years ago · comments

Organization Name:

Short summary about the issue/question:

when training a face recognition job with pai using pytorch DDP, after 3~4 days, a process in job will be killed, according to the "dmesg -T", it because of "Memory cgroup out of memory". But loook at the job metrics, the memory was not exceed the limit, it seems file cache was also counted in memory usage, have you meet the similar problem?

Brief what process you are following:

How to reproduce it:

OpenPAI Environment:

OpenPAI version:
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
Kernel (e.g. uname -a):
Hardware (e.g. core number, memory size, storage size, GPU type etc.):
Others:

Anything else we need to know:

Binyang Li · Answer 1 · Mon Aug 29 2022 16:07:01 GMT+0800 (China Standard Time)

Please refer: https://community.ibm.com/community/user/aiops/blogs/riley-zimmerman/2021/07/02/memory-measurements-complexities-part2

CoderInCV · Answer 2 · Mon Aug 29 2022 17:07:52 GMT+0800 (China Standard Time)

What I was encounted is one of processes in the container will be killed by OS because of the OOM, however, the actual memory usage was not exceed the limit, but there are some cache because of the dataset read, did you encounted the same problem in your production env?

Binyang Li · Answer 3 · Tue Aug 30 2022 09:29:25 GMT+0800 (China Standard Time)

Yes, according to the doc The cache usage is counted within the container’s cgroup, which can be restricted in size by the limit
And you can refer this question: https://serverfault.com/questions/516074/why-are-applications-in-a-memory-limited-lxc-container-writing-large-files-to-di/516088#516088?s=74b50a84893c4eb0a1532ea0d8532457, hope it will help

CoderInCV · Answer 4 · Fri Sep 16 2022 16:26:14 GMT+0800 (China Standard Time)

Yes, according to the doc The cache usage is counted within the container’s cgroup, which can be restricted in size by the limit And you can refer this question: https://serverfault.com/questions/516074/why-are-applications-in-a-memory-limited-lxc-container-writing-large-files-to-di/516088#516088?s=74b50a84893c4eb0a1532ea0d8532457, hope it will help

thanks, I have set /proc/sys/vm/dirty_background_ratio and some other kernel params, but it also occured OOM. what did you do in your production env to solve this?

CoderInCV · Answer 5 · Mon Sep 26 2022 10:34:07 GMT+0800 (China Standard Time)

https://github.com/linchpiner/cgroup-memory-manager
found a workaround