RAM leak problem when training in command batch_queue.get(). Where do you release resources once a batch is trained.

Question

RAM leak problem when training in command batch_queue.get(). Where do you release resources once a batch is trained.

vivasvan1 opened this issue 4 years ago · comments

I have noticed that your training loop leaks small amounts of RAM memory.Any idea on what may have caused this?

time taken= 9.865329265594482 | steps= 1 | cpu= 51.8 | ram= 34.50078675328186 | gpu= [3101]
[5613]
time taken= 0.934636116027832 | steps= 2 | cpu= 27.0 | ram= 29.34866251942084 | gpu= [5613]
[3045]
time taken= 0.8695635795593262 | steps= 3 | cpu= 29.4 | ram= 29.217970957706278 | gpu= [3045]
[3021]
time taken= 0.8483304977416992 | steps= 4 | cpu= 29.8 | ram= 29.033316428574086 | gpu= [3021]
[2997]
time taken= 0.8630681037902832 | steps= 5 | cpu= 30.2 | ram= 28.87988403913803 | gpu= [2997]
[2997]
time taken= 0.8645083904266357 | steps= 6 | cpu= 29.4 | ram= 28.714746447210654 | gpu= [2997]
[2997]
time taken= 0.864253044128418 | steps= 7 | cpu= 29.3 | ram= 28.573093657739385 | gpu= [2997]
[2997]
time taken= 0.8693573474884033 | steps= 8 | cpu= 29.3 | ram= 28.389703885656044 | gpu= [2997]
[2997]
time taken= 0.8704898357391357 | steps= 9 | cpu= 29.4 | ram= 28.298690976454438 | gpu= [2997]
[2997]
time taken= 0.8670341968536377 | steps= 10 | cpu= 29.5 | ram= 28.13385097442091 | gpu= [2997]
[2997]
time taken= 0.8750414848327637 | steps= 11 | cpu= 29.5 | ram= 27.959884882309396 | gpu= [2997]
[2997]
time taken= 0.8624210357666016 | steps= 12 | cpu= 29.9 | ram= 27.784356443255188 | gpu= [2997]
[2997]
time taken= 0.8561670780181885 | steps= 13 | cpu= 29.8 | ram= 27.644241201568796 | gpu= [2997]
[2997]
time taken= 0.8609695434570312 | steps= 14 | cpu= 29.7 | ram= 27.51883186047002 | gpu= [2997]
[2997]
time taken= 0.8462607860565186 | steps= 15 | cpu= 29.7 | ram= 27.36641623650461 | gpu= [2997]
[2997]
time taken= 0.8624782562255859 | steps= 16 | cpu= 29.2 | ram= 27.23760941078441 | gpu= [2997]
[2997]
time taken= 0.8649694919586182 | steps= 17 | cpu= 29.4 | ram= 27.113514425050127 | gpu= [2997]
[2997]
time taken= 0.8661544322967529 | steps= 18 | cpu= 29.3 | ram= 27.004993310427178 | gpu= [2997]
[2997]
time taken= 0.8687705993652344 | steps= 19 | cpu= 29.8 | ram= 26.82090916192486 | gpu= [2997]
[2997]
time taken= 0.8823645114898682 | steps= 20 | cpu= 29.6 | ram= 26.688630454109777 | gpu= [2997]
[2997]
time taken= 0.8795809745788574 | steps= 21 | cpu= 29.4 | ram= 26.517987449146226 | gpu= [2997]
[2997]
time taken= 0.8857841491699219 | steps= 22 | cpu= 29.1 | ram= 26.40289455770082 | gpu= [2997]
[2997]
time taken= 0.8605339527130127 | steps= 23 | cpu= 29.5 | ram= 26.274509317663572 | gpu= [2997]
[2997]
time taken= 0.8524265289306641 | steps= 24 | cpu= 29.8 | ram= 26.16445065525575 | gpu= [2997]

Vivasvan · Answer 1 · Fri Sep 18 2020 19:59:35 GMT+0800 (China Standard Time)

Can you check if it is only on my pc or is this happening with your code too?

also, is there any way in which I don't have to load the full dataset on the memory for training in mxnet?

Vivasvan · Answer 2 · Sat Sep 19 2020 05:47:56 GMT+0800 (China Standard Time)

I have found using pdb that after every run of

batch = batch_queue.get()

an extra 0.10-0.15% ram is consumed which seems to never get released.

(Pdb) print("| ram=",psutil.virtual_memory().available * 100 / psutil.virtual_memory().total)
 **ram= 28.66921519345203** 
(Pdb) n
> /home/mask/maskflownet/MaskFlownet/main.py(572)<module>()
-> loading_time.update(default_timer() - t0)
(Pdb) print("| ram=",psutil.virtual_memory().available * 100 / psutil.virtual_memory().total)
**ram= 28.542640291935687**

I cannot find why this is happening but i am sure of it. Can you help me fix this please?

Yilun (Simon) Sheng · Answer 3 · Sat Sep 26 2020 11:10:13 GMT+0800 (China Standard Time)

Hi vivasvan1, thanks for pointing out this problem.

We import this Queue method from the python queue package directly without any modification. I search on Google and find that other people encounter the same problem. So maybe this is not a problem with our code but the python queue package.