Memory cgroup out of memory will happen when training a job for few days
HaoLiuHust opened this issue · comments
Organization Name:
Short summary about the issue/question:
when training a face recognition job with pai using pytorch DDP, after 3~4 days, a process in job will be killed, according to the "dmesg -T", it because of "Memory cgroup out of memory". But loook at the job metrics, the memory was not exceed the limit, it seems file cache was also counted in memory usage, have you meet the similar problem?
Brief what process you are following:
How to reproduce it:
OpenPAI Environment:
- OpenPAI version:
- Cloud provider or hardware configuration:
- OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a
): - Hardware (e.g. core number, memory size, storage size, GPU type etc.):
- Others:
Anything else we need to know:
What I was encounted is one of processes in the container will be killed by OS because of the OOM, however, the actual memory usage was not exceed the limit, but there are some cache because of the dataset read, did you encounted the same problem in your production env?
Yes, according to the doc The cache usage is counted within the container’s cgroup, which can be restricted in size by the limit
And you can refer this question: https://serverfault.com/questions/516074/why-are-applications-in-a-memory-limited-lxc-container-writing-large-files-to-di/516088#516088?s=74b50a84893c4eb0a1532ea0d8532457, hope it will help
Yes, according to the doc
The cache usage is counted within the container’s cgroup, which can be restricted in size by the limit
And you can refer this question: https://serverfault.com/questions/516074/why-are-applications-in-a-memory-limited-lxc-container-writing-large-files-to-di/516088#516088?s=74b50a84893c4eb0a1532ea0d8532457, hope it will help
thanks, I have set /proc/sys/vm/dirty_background_ratio and some other kernel params, but it also occured OOM. what did you do in your production env to solve this?
https://github.com/linchpiner/cgroup-memory-manager
found a workaround