OWL_ViT Original error: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 427182260224 bytes.

Question

OWL_ViT Original error: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 427182260224 bytes.

phamnhuvu-dev opened this issue 7 months ago · comments

I have encountered this issue when running the following command:

python -m scenic.projects.owl_vit.main \
  --alsologtostderr=true \
  --workdir=/tmp/training \
  --config=scenic/projects/owl_vit/configs/clip_b32.py

Are there any ways to resolve this issue?

Exception:

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspaces/owl_vit/scenic/projects/owl_vit/notebooks/scenic/projects/owl_vit/main.py", line 48, in <module>
    app.run(main=main)
  File "/workspaces/owl_vit/scenic/projects/owl_vit/notebooks/scenic/app.py", line 68, in run
    app.run(functools.partial(_run_main, main=main))
  File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/workspaces/owl_vit/scenic/projects/owl_vit/notebooks/scenic/app.py", line 104, in _run_main
    main(rng=rng, config=FLAGS.config, workdir=FLAGS.workdir, writer=writer)
  File "/workspaces/owl_vit/scenic/projects/owl_vit/notebooks/scenic/projects/owl_vit/main.py", line 38, in main
    trainer.train(
  File "/workspaces/owl_vit/scenic/projects/owl_vit/notebooks/scenic/projects/owl_vit/trainer.py", line 394, in train
    train_state, t_metrics = train_step_pmapped(train_state, train_batch)
jaxlib.xla_extension.XlaRuntimeError: UNKNOWN: Failed to determine best cudnn convolution algorithm for:
%cudnn-conv.4 = (f32[3,196608,32,32]{3,2,1,0}, u8[0]{0}) custom-call(f32[3,256,768,768]{3,2,1,0} %transpose.4908, f32[196608,1,737,737]{3,2,1,0} %pad.55), window={size=737x737}, dim_labels=bf01_oi01->bf01, feature_group_count=256, custom_call_target="__cudnn$convForward", metadata={op_name="pmap(train_step)/jit(main)/vmap(transpose(jvp(TextZeroShotDetectionModule)))/TextZeroShotDetectionModule.image_embedder/backbone/clip/clip.encode_image/visual/conv1/conv_general_dilated[window_strides=(1, 1) padding=((0, 0), (0, 0)) lhs_dilation=(1, 1) rhs_dilation=(32, 32) dimension_numbers=ConvDimensionNumbers(lhs_spec=(3, 0, 1, 2), rhs_spec=(3, 0, 1, 2), out_spec=(2, 3, 0, 1)) feature_group_count=256 batch_group_count=1 precision=None preferred_element_type=None]" source_file="/usr/local/lib/python3.10/dist-packages/flax/linen/linear.py" source_line=541}, backend_config={"conv_result_scale":1,"activation_mode":"kNone","side_input_scale":0,"leakyrelu_alpha":0}

Original error: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 427182260224 bytes.

To ignore this failure and try to use a fallback algorithm (which may have suboptimal performance), use XLA_FLAGS=--xla_gpu_strict_conv_algorithm_picker=false.  Please also file a bug for the root cause of failing autotuning.

Phạm Như Vũ · Answer 1 · Mon Mar 04 2024 00:56:41 GMT+0800 (China Standard Time)

I reduced batch size to 2, config.batch_size = 2 and set XLA_PYTHON_CLIENT_PREALLOCATE=false
It takes ~19GB VRAM