LayoutLMv3 document layout train error

Question

LayoutLMv3 document layout train error

yuumiin opened this issue 2 years ago · comments

Describe
Model I am using LayoutLMv3

after running object_detection/train_net.py
cuda error RuntimeError: CUDA out of memory

Error occurred after adjusting IMS_PER_BATCH in cascade_layoutlmv3.yaml
TypeError: save_for_backward can only save variables, but argument 6 is of type bool

How to solve it???

Yupan Huang · Answer 1 · Wed Jun 01 2022 11:48:18 GMT+0800 (China Standard Time)

It is hard to locate the cause of errors and debug without error stack traces.
Have you set IMS_PER_BATCH to a multiple of your GPU size? For example, if you are using 8 GPUs, then you could set IMS_PER_BATCH to 8, 16, etc.

Yangchun W · Answer 2 · Fri Jul 22 2022 18:41:06 GMT+0800 (China Standard Time)

@HYPJUDY

I also encountered this problem，How can I solve it？

Haoli Bai · Answer 3 · Mon Sep 05 2022 22:41:53 GMT+0800 (China Standard Time)

Same here, interesting..
I am using PyTorch 1.8.0 and I suspect there can be issues with this version.
This problem happens inside gradient checkpointing function in Torch.
In PyTorch 1.8.0, the gradient checkpointing function saves everything in the input for backward, which must be torch.tensors. This can be problematic if there are non-tensor input as function input (layer_module.forward in this case). Below is the source code of CheckPointFunction in Torch 1.8.0:

class CheckpointFunction(torch.autograd.Function):

    @staticmethod
    def forward(ctx, run_function, preserve_rng_state, *args):
        check_backward_validity(args)
        ctx.run_function = run_function
        ctx.preserve_rng_state = preserve_rng_state
        ctx.had_autocast_in_fwd = torch.is_autocast_enabled()
        if preserve_rng_state:
            ctx.fwd_cpu_state = torch.get_rng_state()
            # Don't eagerly initialize the cuda context by accident.
            # (If the user intends that the context is initialized later, within their
            # run_function, we SHOULD actually stash the cuda state here.  Unfortunately,
            # we have no way to anticipate this will happen before we run the function.)
            ctx.had_cuda_in_fwd = False
            if torch.cuda._initialized:
                ctx.had_cuda_in_fwd = True
                ctx.fwd_gpu_devices, ctx.fwd_gpu_states = get_device_states(*args)
        ctx.save_for_backward(*args)
        with torch.no_grad():
            outputs = run_function(*args)
        return outputs

However, this issue is fixed in PyTorch 1.10.0, as also explained in this blog. The difference lies in L27-L41.

So update your PyTorch version should fix the issue.

mllife · Answer 4 · Mon Jul 24 2023 16:53:46 GMT+0800 (China Standard Time)

Windows: set 'PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512'

Linux: export 'PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512'