microsoft / MaskFlownet

[CVPR 2020, Oral] MaskFlownet: Asymmetric Feature Matching with Learnable Occlusion Mask

Home Page:https://arxiv.org/abs/2003.10955

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RAM leak problem when training in command batch_queue.get(). Where do you release resources once a batch is trained.

vivasvan1 opened this issue · comments

I have noticed that your training loop leaks small amounts of RAM memory.Any idea on what may have caused this?

time taken= 9.865329265594482 | steps= 1 | cpu= 51.8 | ram= 34.50078675328186 | gpu= [3101]
[5613]
time taken= 0.934636116027832 | steps= 2 | cpu= 27.0 | ram= 29.34866251942084 | gpu= [5613]
[3045]
time taken= 0.8695635795593262 | steps= 3 | cpu= 29.4 | ram= 29.217970957706278 | gpu= [3045]
[3021]
time taken= 0.8483304977416992 | steps= 4 | cpu= 29.8 | ram= 29.033316428574086 | gpu= [3021]
[2997]
time taken= 0.8630681037902832 | steps= 5 | cpu= 30.2 | ram= 28.87988403913803 | gpu= [2997]
[2997]
time taken= 0.8645083904266357 | steps= 6 | cpu= 29.4 | ram= 28.714746447210654 | gpu= [2997]
[2997]
time taken= 0.864253044128418 | steps= 7 | cpu= 29.3 | ram= 28.573093657739385 | gpu= [2997]
[2997]
time taken= 0.8693573474884033 | steps= 8 | cpu= 29.3 | ram= 28.389703885656044 | gpu= [2997]
[2997]
time taken= 0.8704898357391357 | steps= 9 | cpu= 29.4 | ram= 28.298690976454438 | gpu= [2997]
[2997]
time taken= 0.8670341968536377 | steps= 10 | cpu= 29.5 | ram= 28.13385097442091 | gpu= [2997]
[2997]
time taken= 0.8750414848327637 | steps= 11 | cpu= 29.5 | ram= 27.959884882309396 | gpu= [2997]
[2997]
time taken= 0.8624210357666016 | steps= 12 | cpu= 29.9 | ram= 27.784356443255188 | gpu= [2997]
[2997]
time taken= 0.8561670780181885 | steps= 13 | cpu= 29.8 | ram= 27.644241201568796 | gpu= [2997]
[2997]
time taken= 0.8609695434570312 | steps= 14 | cpu= 29.7 | ram= 27.51883186047002 | gpu= [2997]
[2997]
time taken= 0.8462607860565186 | steps= 15 | cpu= 29.7 | ram= 27.36641623650461 | gpu= [2997]
[2997]
time taken= 0.8624782562255859 | steps= 16 | cpu= 29.2 | ram= 27.23760941078441 | gpu= [2997]
[2997]
time taken= 0.8649694919586182 | steps= 17 | cpu= 29.4 | ram= 27.113514425050127 | gpu= [2997]
[2997]
time taken= 0.8661544322967529 | steps= 18 | cpu= 29.3 | ram= 27.004993310427178 | gpu= [2997]
[2997]
time taken= 0.8687705993652344 | steps= 19 | cpu= 29.8 | ram= 26.82090916192486 | gpu= [2997]
[2997]
time taken= 0.8823645114898682 | steps= 20 | cpu= 29.6 | ram= 26.688630454109777 | gpu= [2997]
[2997]
time taken= 0.8795809745788574 | steps= 21 | cpu= 29.4 | ram= 26.517987449146226 | gpu= [2997]
[2997]
time taken= 0.8857841491699219 | steps= 22 | cpu= 29.1 | ram= 26.40289455770082 | gpu= [2997]
[2997]
time taken= 0.8605339527130127 | steps= 23 | cpu= 29.5 | ram= 26.274509317663572 | gpu= [2997]
[2997]
time taken= 0.8524265289306641 | steps= 24 | cpu= 29.8 | ram= 26.16445065525575 | gpu= [2997]

Can you check if it is only on my pc or is this happening with your code too?

also, is there any way in which I don't have to load the full dataset on the memory for training in mxnet?

I have found using pdb that after every run of

batch = batch_queue.get()

an extra 0.10-0.15% ram is consumed which seems to never get released.

(Pdb) print("| ram=",psutil.virtual_memory().available * 100 / psutil.virtual_memory().total)
 **ram= 28.66921519345203** 
(Pdb) n
> /home/mask/maskflownet/MaskFlownet/main.py(572)<module>()
-> loading_time.update(default_timer() - t0)
(Pdb) print("| ram=",psutil.virtual_memory().available * 100 / psutil.virtual_memory().total)
**ram= 28.542640291935687**

I cannot find why this is happening but i am sure of it. Can you help me fix this please?

Hi vivasvan1, thanks for pointing out this problem.

We import this Queue method from the python queue package directly without any modification. I search on Google and find that other people encounter the same problem. So maybe this is not a problem with our code but the python queue package.