OWL_ViT Original error: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 427182260224 bytes.
phamnhuvu-dev opened this issue · comments
Phạm Như Vũ commented
I have encountered this issue when running the following command:
python -m scenic.projects.owl_vit.main \
--alsologtostderr=true \
--workdir=/tmp/training \
--config=scenic/projects/owl_vit/configs/clip_b32.py
Are there any ways to resolve this issue?
Exception:
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/workspaces/owl_vit/scenic/projects/owl_vit/notebooks/scenic/projects/owl_vit/main.py", line 48, in <module>
app.run(main=main)
File "/workspaces/owl_vit/scenic/projects/owl_vit/notebooks/scenic/app.py", line 68, in run
app.run(functools.partial(_run_main, main=main))
File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/workspaces/owl_vit/scenic/projects/owl_vit/notebooks/scenic/app.py", line 104, in _run_main
main(rng=rng, config=FLAGS.config, workdir=FLAGS.workdir, writer=writer)
File "/workspaces/owl_vit/scenic/projects/owl_vit/notebooks/scenic/projects/owl_vit/main.py", line 38, in main
trainer.train(
File "/workspaces/owl_vit/scenic/projects/owl_vit/notebooks/scenic/projects/owl_vit/trainer.py", line 394, in train
train_state, t_metrics = train_step_pmapped(train_state, train_batch)
jaxlib.xla_extension.XlaRuntimeError: UNKNOWN: Failed to determine best cudnn convolution algorithm for:
%cudnn-conv.4 = (f32[3,196608,32,32]{3,2,1,0}, u8[0]{0}) custom-call(f32[3,256,768,768]{3,2,1,0} %transpose.4908, f32[196608,1,737,737]{3,2,1,0} %pad.55), window={size=737x737}, dim_labels=bf01_oi01->bf01, feature_group_count=256, custom_call_target="__cudnn$convForward", metadata={op_name="pmap(train_step)/jit(main)/vmap(transpose(jvp(TextZeroShotDetectionModule)))/TextZeroShotDetectionModule.image_embedder/backbone/clip/clip.encode_image/visual/conv1/conv_general_dilated[window_strides=(1, 1) padding=((0, 0), (0, 0)) lhs_dilation=(1, 1) rhs_dilation=(32, 32) dimension_numbers=ConvDimensionNumbers(lhs_spec=(3, 0, 1, 2), rhs_spec=(3, 0, 1, 2), out_spec=(2, 3, 0, 1)) feature_group_count=256 batch_group_count=1 precision=None preferred_element_type=None]" source_file="/usr/local/lib/python3.10/dist-packages/flax/linen/linear.py" source_line=541}, backend_config={"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}
Original error: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 427182260224 bytes.
To ignore this failure and try to use a fallback algorithm (which may have suboptimal performance), use XLA_FLAGS=--xla_gpu_strict_conv_algorithm_picker=false. Please also file a bug for the root cause of failing autotuning.
Phạm Như Vũ commented
I reduced batch size to 2, config.batch_size = 2
and set XLA_PYTHON_CLIENT_PREALLOCATE=false
It takes ~19GB VRAM