microsoft / pai

Resource scheduling and cluster management for AI

Home Page:https://openpai.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Memory cgroup out of memory will happen when training a job for few days

HaoLiuHust opened this issue · comments

Organization Name:

Short summary about the issue/question:

when training a face recognition job with pai using pytorch DDP, after 3~4 days, a process in job will be killed, according to the "dmesg -T", it because of "Memory cgroup out of memory". But loook at the job metrics, the memory was not exceed the limit, it seems file cache was also counted in memory usage, have you meet the similar problem?

Brief what process you are following:

How to reproduce it:

OpenPAI Environment:

  • OpenPAI version:
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Hardware (e.g. core number, memory size, storage size, GPU type etc.):
  • Others:

Anything else we need to know:

What I was encounted is one of processes in the container will be killed by OS because of the OOM, however, the actual memory usage was not exceed the limit, but there are some cache because of the dataset read, did you encounted the same problem in your production env?

Yes, according to the doc The cache usage is counted within the container’s cgroup, which can be restricted in size by the limit
And you can refer this question: https://serverfault.com/questions/516074/why-are-applications-in-a-memory-limited-lxc-container-writing-large-files-to-di/516088#516088?s=74b50a84893c4eb0a1532ea0d8532457, hope it will help

Yes, according to the doc The cache usage is counted within the container’s cgroup, which can be restricted in size by the limit And you can refer this question: https://serverfault.com/questions/516074/why-are-applications-in-a-memory-limited-lxc-container-writing-large-files-to-di/516088#516088?s=74b50a84893c4eb0a1532ea0d8532457, hope it will help

thanks, I have set /proc/sys/vm/dirty_background_ratio and some other kernel params, but it also occured OOM. what did you do in your production env to solve this?