una-dinosauria / Rayuela.jl

Code for my PhD thesis. Library of quantization-based methods for fast similarity search in high dimensions. Presented at ECCV 18.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

LSQ training got stuck

dryman opened this issue · comments

Hi, we're trying to reproduce ECCV'18 paper.

The trainer got stuck in this stage:

Running CUDA LSQ training... 
**********************************************************************************************
Training LSQ GPU with 7 codebooks, 4 perturbations, 4 icm iterations and random order = true
**********************************************************************************************
Doing fast bin codebook update... done in 0.129 seconds.
 -2 1.913506e+04 
Creating 100000 random states... done in 0.15 seconds
^^^ stuck on this stage for 3 hours ^^^^^^

We checked the GPU utilization and found it was zero.
Is this expected?

Nope, that is pretty weird. Could you please post the command that you ran?
Also, if you terminate ctrl+c when julia is stuck there, what does the stack trace say?

My command was
include("demos_train_query_base.jl")
Terminating julia via ctrl+c didn't work. Killing it with kill I got this:

in expression starting at /usr/local/google/home/fchern/.julia/environments/v0.7/dev/Rayuela/demos/demos_train_query_base.jl:170
clock_gettime at linux-vdso.so.1 (unknown line)
__clock_gettime at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7fd1d47110dd)
unknown function (ip: 0x7fd1d47c6996)
unknown function (ip: 0x7fd1d46f979b)
unknown function (ip: 0x7fd1d46e9947)
unknown function (ip: 0x7fd1d46ea73c)
unknown function (ip: 0x7fd1d45f78dd)
unknown function (ip: 0x7fd1d45f9297)
cuMemcpyHtoD_v2 at /usr/lib/x86_64-linux-gnu/libcuda.so (unknown line)
macro expansion at /usr/local/google/home/fchern/.julia/packages/CUDAdrv/LC5XS/src/base.jl:146 [inlined]
#upload!#10 at /usr/local/google/home/fchern/.julia/packages/CUDAdrv/LC5XS/src/memory.jl:230
upload! at /usr/local/google/home/fchern/.julia/packages/CUDAdrv/LC5XS/src/memory.jl:229 [inlined]
upload! at /usr/local/google/home/fchern/.julia/packages/CUDAdrv/LC5XS/src/memory.jl:229 [inlined]
unsafe_copyto! at /usr/local/google/home/fchern/.julia/packages/CuArrays/f4Eke/src/array.jl:76 [inlined]
copyto! at /usr/local/google/home/fchern/.julia/packages/GPUArrays/AkOwl/src/abstractarray.jl:116
convert at /usr/local/google/home/fchern/.julia/packages/CuArrays/f4Eke/src/array.jl:99 [inlined]
convert at /usr/local/google/home/fchern/.julia/packages/CuArrays/f4Eke/src/array.jl:105 [inlined]
Type at /usr/local/google/home/fchern/.julia/packages/GPUArrays/AkOwl/src/construction.jl:36 [inlined]
encode_icm_cuda_single at /usr/local/google/home/fchern/.julia/environments/v0.7/dev/Rayuela/src/LSQ_GPU.jl:72
encode_icm_cuda at /usr/local/google/home/fchern/.julia/environments/v0.7/dev/Rayuela/src/LSQ_GPU.jl:231
macro expansion at ./printf.jl:159 [inlined]
train_lsq_cuda at /usr/local/google/home/fchern/.julia/environments/v0.7/dev/Rayuela/src/LSQ_GPU.jl:297
experiment_lsq_cuda at /usr/local/google/home/fchern/.julia/environments/v0.7/dev/Rayuela/src/LSQ_GPU.jl:345
unknown function (ip: 0x7fd1dc83b100)
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1829
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2182
run_demos at /usr/local/google/home/fchern/.julia/environments/v0.7/dev/Rayuela/demos/demos_train_query_base.jl:72
top-level scope at /usr/local/google/home/fchern/.julia/environments/v0.7/dev/Rayuela/demos/demos_train_query_base.jl:171 [inlined]
top-level scope at ./none:0
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1829
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:825
jl_parse_eval_all at /buildworker/worker/package_linux64/build/src/ast.c:841
jl_load at /buildworker/worker/package_linux64/build/src/toplevel.c:865
include at ./boot.jl:317 [inlined]
include_relative at ./loading.jl:1038
include at ./sysimg.jl:29
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2182
include at ./client.jl:398
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2182
do_call at /buildworker/worker/package_linux64/build/src/interpreter.c:324
eval_value at /buildworker/worker/package_linux64/build/src/interpreter.c:428
eval_stmt_value at /buildworker/worker/package_linux64/build/src/interpreter.c:363 [inlined]
eval_body at /buildworker/worker/package_linux64/build/src/interpreter.c:686
jl_interpret_toplevel_thunk_callback at /buildworker/worker/package_linux64/build/src/interpreter.c:799
unknown function (ip: 0xfffffffffffffffe)
unknown function (ip: 0x7fd18828828f)
unknown function (ip: 0xffffffffffffffff)
jl_interpret_toplevel_thunk at /buildworker/worker/package_linux64/build/src/interpreter.c:808
jl_toplevel_eval_flex at /buildworker/worker/package_linux64/build/src/toplevel.c:831
jl_toplevel_eval_in at /buildworker/worker/package_linux64/build/src/builtins.c:633
eval at ./boot.jl:319
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2182
eval_user_input at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v0.7/REPL/src/REPL.jl:85
run_backend at /usr/local/google/home/fchern/.julia/packages/Revise/EuQoV/src/Revise.jl:771
#58 at ./task.jl:262
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1829
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2182
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1538 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:268
unknown function (ip: 0xffffffffffffffff)
unknown function (ip: 0xffffffffffffffff)
Allocations: 1415364916 (Pool: 1414649159; Big: 715757); GC: 20296
InterruptException
atexit hook threw an error: OutOfMemoryError()
signal (11): Segmentation fault
in expression starting at /usr/local/google/home/fchern/.julia/environments/v0.7/dev/Rayuela/demos/demos_train_query_base.jl:170
throw_internal at /buildworker/worker/package_linux64/build/src/task.c:563
jl_rethrow at /buildworker/worker/package_linux64/build/src/task.c:584
unknown function (ip: 0xffffffffffffffff)

Closing since this seems to have been caused by OOM in the GPU and the host machine not killing Julia -- xref #40 (comment)