Cuda Problem happens within test_vilib

Question

Cuda Problem happens within test_vilib

DongDongXA opened this issue 4 years ago · comments

I compiled and ran this project on Jetson AGX Xavier developer kit, while I ran the test called test_vilib, Image Pyramid, SubframePool, PyramidPool all show success, but FAST detector showed no result, and the test program paused there, so I decided to find where it pause.
Then I found that FAST_CPU detector is completely ok, and the test pause within one member function called copyGridToHost belong to FAST_GPU's parent class DetectorBaseGPU. Finally I find that the test halts just after it successfully runs,
CUDA_API_CALL(cudaMemcpyAsync(h_feature_grid_,
d_feature_grid_,
feature_grid_bytes_,
cudaMemcpyDeviceToHost,
stream_)) ,
the test halts while execute CUDA_API_CALL(cudaStreamSynchronize(stream_))
Since I am not very familiar with cuda api, I haven't tried to delete this piece of synchronize code, I just need to know how to fix this bug in this project.
I have read this paper, ur work's result is so exciting, and I really expect to see this result in my computer!!!!
Thanks in advance :)

Balazs Nagy · Answer 1 · Mon Apr 13 2020 21:31:02 GMT+0800 (China Standard Time)

Hi, thanks for the feedback.

Just as a first verification, please run ls -1 test/images/euroc/images/752_480/ | wc -l from the visual_lib folder. You should get a number N - the number of images you extracted from the Euroc machine hall dataset. (3682 >)
Could you also provide the output of gcc --version and nvcc --version
Could you recompile the library with a modification to the Makefile and tell us what you observe (apart from it being slower): https://github.com/uzh-rpg/rpg_cuda_thesis/blob/master/visual_lib/Makefile#L6 - change RELEASE_MODE=0, then make clean and make test -j4

Thanks.

DongDong · Answer 2 · Mon Apr 13 2020 21:56:17 GMT+0800 (China Standard Time)

Hi, thank u for replying so quickly.

Considering Xavier's flash storage is so limited, I use V1_02_medium.bag instead, it is the smallest data dataset within euroc datasets, I checked ur file io interface and I think it doesn't affect the test programme's performance. After I run the ls command, it shows (1710).
gcc:gcc (Ubuntu/Linaro 7.5.0-3ubuntu1~18.04) 7.5.0

nvcc:nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sun_Sep_30_21:09:22_CDT_2018
Cuda compilation tools, release 10.0, V10.0.166

uname -a:
Linux nvidia-desktop 4.9.140-tegra #1 SMP PREEMPT Wed Mar 13 00:30:11 PDT 2019 aarch64 aarch64 aarch64 GNU/Linux
I fogot to tell you whether RELEASE_MODE=0 or 1, it shows exactly the same results, I also checked the CPU detector's result and it is ok.

Balazs Nagy · Answer 3 · Tue Apr 14 2020 02:45:53 GMT+0800 (China Standard Time)

Unfortunately, I couldn't reproduce the issue on our boards, but, I had an idea that might solve your issue on the Xavier. Would you mind trying the following patch file, please?

Just unzip the zip file somewhere in the repository, then perform git apply issue_04.patch, then make test.

issue_04.zip

Thank you in advance!

DongDong · Answer 4 · Tue Apr 14 2020 09:49:04 GMT+0800 (China Standard Time)

It is really weird，ur codes runs well on jetson tx2 with same version gcc and cuda even without this patch， I need to verify this patch on xavier later in ur morning.

DongDong · Answer 5 · Tue Apr 14 2020 18:17:39 GMT+0800 (China Standard Time)

After I apply this patch, it is still not working on xavier and suspend within the same CUDA_API_CALL(cudaStreamSynchronize(stream_)) in copyGridToHost.
I can promise that the cuda environment is ok, since I have deployed tensorrt on this xavier machine.

Balazs Nagy · Answer 6 · Thu Apr 16 2020 05:11:06 GMT+0800 (China Standard Time)

Sorry for a bit of delay from our side.
We've prepared a fix branch that ought to fix issues with 7.x devices. We could reproduce the issue with a 7.5 CC. GPU today, but since our paper used 5.x and 6.x devices we didn't catch this one.
We'll merge this fix branch (fix/volta_turing) to master, but we would be glad if you could confirm that the issue has been resolved also on your side.
You may ignore now the previous patch.

DongDong · Answer 7 · Thu Apr 16 2020 21:07:20 GMT+0800 (China Standard Time)

I have tested this fix branch on xavier, it goes well, thank u for ur efforts on it. And I have some questions about this projects' performances reflected in this test's results and try to find answers within ur paper and codes. I need to sum up all of these questions in a few days and would open another issue. Thank you very much.