cnn face detector example OOMs when processing images in series

Question

cnn face detector example OOMs when processing images in series

sandeen opened this issue 2 years ago · comments

Expected Behavior

The cnn face detector example should process multiple images in a row without running out of memory.
I understand that some images may not fit into the GPU. However, if I am able to process images individually, I would like to be able to process them in a loop as well.
(tl;dr: can the example be modified to free GPU resources between images?)

Current Behavior

With the two provided test images, a.jpg and b.jpg (below), I can successfully process either one, individually.
I can also process b.jpg and a.jpg together, in that specific order.
If I try to process a.jpg and b.jpg in that order, I run out of GPU memory.
Is there a call to make in between cnn_face_detector() calls to free GPU memory, or something like that?

Steps to Reproduce

Modify the cnn face detector example to change upsampling from 1 to 0, and remove the window display and hit_enter_to_continue() calls.
Run the cnn example on file a, file b, files (b & a), then files (a & b):

[root@host test-dlib]# ./dlib-cnn.py mmod_human_face_detector.dat a.jpg
Processing file: a.jpg
Number of faces detected: 0
[root@host test-dlib]# ./dlib-cnn.py mmod_human_face_detector.dat b.jpg
Processing file: b.jpg
Number of faces detected: 0
[root@host test-dlib]# ./dlib-cnn.py mmod_human_face_detector.dat b.jpg a.jpg
Processing file: b.jpg
Number of faces detected: 0
Processing file: a.jpg
Number of faces detected: 0
[root@host test-dlib]# ./dlib-cnn.py mmod_human_face_detector.dat a.jpg b.jpg
Processing file: a.jpg
Number of faces detected: 0
Processing file: b.jpg
Traceback (most recent call last):
  File "/root/test-dlib/./dlib-cnn.py", line 55, in <module>
    dets = cnn_face_detector(img, 0)
RuntimeError: Error while calling cudaMalloc(&data, n) in file /root/rpmbuild/BUILD/dlib-19.23/dlib/cuda/cuda_data_ptr.cpp:58. code: 2, reason: out of memory

Version: 19.23
Where did you get dlib: tarball from dlib.net, rebuilt locally into an RPM with Fedora spec file
Platform: Fedora release 33 (Thirty Three), x86_64
Compiler: gcc version 10.3.1 20210422 (Red Hat 10.3.1-1) (GCC)
CUDA: 11.4.2

GPU:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74       Driver Version: 470.74       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:02:00.0 Off |                  N/A |
| 35%   27C    P0    N/A /  30W |      0MiB /  2000MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

test images:

my test script:

dlib-cnn.py.txt

Davis E. King · Answer 1 · Mon Mar 07 2022 04:08:13 GMT+0800 (China Standard Time)

No, there isn't a function to call. It's just that you are really close to running out of RAM in the case that works. You need to use smaller images, or have more RAM.

…

On Sat, Mar 5, 2022 at 10:27 AM Eric Sandeen ***@***.***> wrote: Expected Behavior The cnn face detector example <http://dlib.net/cnn_face_detector.py.html> should process multiple images in a row without running out of memory. I understand that some images may not fit into the GPU. However, if I am able to process images individually, I would like to be able to process them in a loop as well. (tl;dr: can the example be modified to free GPU resources between images?) Current Behavior With the two provided test images, a.jpg and b.jpg (below), I can successfully process either one, individually. I can also process b.jpg and a.jpg together, in that specific order. If I try to process a.jpg and b.jpg in *that* order, I run out of GPU memory. Is there a call to make in between cnn_face_detector() calls to free GPU memory, or something like that? Steps to Reproduce Modify the cnn face detector example <http://dlib.net/cnn_face_detector.py.html> to change upsampling from 1 to 0, and remove the window display and hit_enter_to_continue() calls. Run the cnn example on file a, file b, files (b & a), then files (a & b): ***@***.*** test-dlib]# ./dlib-cnn.py mmod_human_face_detector.dat a.jpg Processing file: a.jpg Number of faces detected: 0 ***@***.*** test-dlib]# ./dlib-cnn.py mmod_human_face_detector.dat b.jpg Processing file: b.jpg Number of faces detected: 0 ***@***.*** test-dlib]# ./dlib-cnn.py mmod_human_face_detector.dat b.jpg a.jpg Processing file: b.jpg Number of faces detected: 0 Processing file: a.jpg Number of faces detected: 0 ***@***.*** test-dlib]# ./dlib-cnn.py mmod_human_face_detector.dat a.jpg b.jpg Processing file: a.jpg Number of faces detected: 0 Processing file: b.jpg Traceback (most recent call last): File "/root/test-dlib/./dlib-cnn.py", line 55, in <module> dets = cnn_face_detector(img, 0) RuntimeError: Error while calling cudaMalloc(&data, n) in file /root/rpmbuild/BUILD/dlib-19.23/dlib/cuda/cuda_data_ptr.cpp:58. code: 2, reason: out of memory - *Version*: 19.23 - *Where did you get dlib*: tarball from dlib.net, rebuilt locally into an RPM with Fedora spec file - *Platform*: Fedora release 33 (Thirty Three), x86_64 - *Compiler*: gcc version 10.3.1 20210422 (Red Hat 10.3.1-1) (GCC) - *CUDA*: 11.4.2 GPU: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.74 Driver Version: 470.74 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A | | 35% 27C P0 N/A / 30W | 0MiB / 2000MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ test images and test script: [image: a] <https://user-images.githubusercontent.com/427124/156889599-274495af-ebd2-4181-bb1d-f2fa2114c1ad.jpg> [image: b] <https://user-images.githubusercontent.com/427124/156889600-2dbd7da7-cda6-4907-ac36-19a0d73525c3.jpg> dlib-cnn.py.txt <https://github.com/davisking/dlib/files/8190933/dlib-cnn.py.txt> — Reply to this email directly, view it on GitHub <#2537>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABPYFR4PFBO4IOEFSWBKJFLU6N4PXANCNFSM5P72GZYQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Eric Sandeen · Answer 2 · Mon Mar 07 2022 05:41:19 GMT+0800 (China Standard Time)

Thank you for the reply. The thing that I'm wondering about is that the two images work just fine one at a time.
So to be clear - are you saying that b.jpg alone essentially exhausts the GPU memory when I do this?

# ./dlib-cnn.py mmod_human_face_detector.dat a.jpg b.jpg

b.jpg file size is only 170k, dimensions 1242x1243, and the GPU has 2G. I'm just trying to get a sense of what I should expect when processing multiple (thousands) of images this way, i.e. how big is too big. I didn't expect this one to be too big.

Thanks!

Eric Sandeen · Answer 3 · Mon Mar 07 2022 05:55:46 GMT+0800 (China Standard Time)

One more datapoint. When I process a.jpg, nvidia-smi shows GPU use topping out at about 1.5G

| 35%   34C    P0    N/A /  30W |   1587MiB /  2000MiB |    100%      Default |

When I process b.jpg, max consumption is indeed slightly more:

| 35%   33C    P0    N/A /  30W |   1711MiB /  2000MiB |     97%      Default |

I just expected that if b.jpg fit into the GPU, then processing a.jpg first would not affect subsequent processing of b.
shrug

Eric Sandeen · Answer 4 · Mon Mar 07 2022 06:13:51 GMT+0800 (China Standard Time)

Anyway, I'll try catching the ENOMEM exception, downsample, and retry the image. I won't need details on face rectangles, I just want a yes/no on "is there a face" so it should be straightforward.

I appreciate the info you've provided! And thanks for all your work on dlib.

Davis E. King · Answer 5 · Thu Mar 10 2022 21:08:33 GMT+0800 (China Standard Time)

It's the reallocations that are failing. Freeing and reallocating memory in cuda can be like that. Like freeing memory doesn't always release it all immediately. I'm not sure if there is some process level memory pooling in the cuda runtime or the memory paging system on the GPU just isn't that good and it gets fragmented or what.