Layoutlmv3 LayoutLMv3ForSequenceClassification not working

Question

Layoutlmv3 LayoutLMv3ForSequenceClassification not working

cwei-bgl opened this issue 2 years ago · comments

Hi,

Thanks for release and sharing LayoutlmV3.

I am trying out LayoutLMV3. LayoutLMv3ForTokenClassification works as expected, but LayoutLMv3ForSequenceClassification does not train. I noticed there is a difference between v2 and v3 implementations for LayoutLMv3ForSequenceClassification below.

v2: there are 3 pieces of information are feeded into classifier layer.

sequence_output = torch.cat(
            [cls_final_output, pooled_initial_image_embeddings, pooled_final_image_embeddings], dim=1
        )

v3: only first cls token is feeded into classifier layer. The 513th image cls token is not taken advantage also.

sequence_output = outputs[0][:, 0, :]

According to the paper, it seemed that LayoutLMv3ForSequenceClassification was used and tested in RVL-CDIP dataset for classification. Could you please confirm that LayoutLMv3ForSequenceClassification was tested or there is something missing in the implementation?

Thanks, Cheng

Yupan Huang · Answer 1 · Thu Jun 09 2022 17:40:52 GMT+0800 (China Standard Time)

Yes, we use the first CLS token for sequence calssification following RoBERTa.
We have also tried LayoutLMv2's method or classsification with the image CLS token, but we didn't observe an improvement, so we just used the simplest method.
Referring to ViT, we kept the image CLS token, but it was not used it in LayoutLMv3.
In our experiments, LayoutLMv3ForSequenceClassification is used for the document image classification task on RVL-CDIP.

cwei-bgl · Answer 2 · Fri Jun 10 2022 06:44:26 GMT+0800 (China Standard Time)

Thanks @HYPJUDY for the prompt reply and confirmation. That is great. I will check if there is any step I did wrong.

cwei-bgl · Answer 3 · Fri Jun 10 2022 11:52:55 GMT+0800 (China Standard Time)

Hi @HYPJUDY , I have just done some experiments. With simple data set, LayoutLMv3ForSequenceClassification was able to train, but the training performance and the final result is not as good as LayoutLMForSequenceClassification. LayoutLMv3ForSequenceClassification was much slower to train than LayoutLMForSequenceClassification. With a slightly complicated dataset, LayoutLMv3ForSequenceClassification was unable to train. Whereas, LayoutLMForSequenceClassification was ok and was able to train well. Do you have some suggestion?

By the way, is there a chance that pretraining objective source code can be shared? I am trying to implement pretraining(so that the pretrained model is more suitable to my domain). I am currently reading the referenced papers and trying to write the objectives. If pretraining objective source code can be shared, that would be wonderful.

Thanks, Cheng

Yupan Huang · Answer 4 · Fri Jun 10 2022 14:08:38 GMT+0800 (China Standard Time)

Hi @cwei-bgl.
Did you use images when using LayoutLMForSequenceClassification? Without images, LayoutLM may be faster than LayoutLMv3.
When you say "unable to train", do you mean that the losses diverge or is there something else going on? You may need to tune parameters (e.g., learning rate, batch size) for different datasets.
Regarding the pre-training code, please see #733.

cwei-bgl · Answer 5 · Fri Jun 10 2022 14:21:10 GMT+0800 (China Standard Time)

Thanks for prompt reply @HYPJUDY . You are right. When I ran with LayoutLMForSequenceClassification, I fed it without image. For LayoutLMv3ForSequenceClassification, I tried both with image and without image. LayoutLM is faster, as it does not have visual embedding. I compared their performance by epochs.

The loss did not diverge, but the loss did not improve. The predicted labels were constant, e.g., always predict as [8, 8, 8, 8, 8...]. I also trained it with autograd.detect_anomaly() and nn.utils.clip_grad_norm_(.., error_if_nonfinite=True) on, so if grad was nan or infinite, I would notice it. It was ok. For microsoft/layoutlmv3-large, I did observe grad infinite with fp16 on.

Thanks, I will try different learning rate, but for batch size, it will depend on machine. There is not much room for me to turn.

Yupan Huang · Answer 6 · Fri Jun 10 2022 16:12:44 GMT+0800 (China Standard Time)

Good luck with you experiments! BTW, the gradient accumulation mechanism can help to expand the batch size.

Cheng Wei · Answer 7 · Mon Jun 13 2022 07:35:20 GMT+0800 (China Standard Time)

Thanks @HYPJUDY for the pointer. You are right. I can expand the batch size with gradient accumulation.

Cheng Wei · Answer 8 · Thu Jun 16 2022 08:19:43 GMT+0800 (China Standard Time)

Hi @HYPJUDY , just let you know expanding the batch size does help. LayoutLMv3ForSequenceClassification was able to train with slightly more complicated dataset. At the moment, its performance is not better than LayoutLMForSequenceClassification.

Jordy Van Landeghem · Answer 9 · Mon Apr 03 2023 21:00:13 GMT+0800 (China Standard Time)

strange why batch size should be such an important factor, any knowledge as to why?

Jordy Van Landeghem · Answer 10 · Mon Apr 03 2023 21:24:56 GMT+0800 (China Standard Time)

I have the same issue that loss always goes to NaN... Tested it with the HF class and I am getting the feeling that there is either a massive bug in there or something misconfigured.

Jordy Van Landeghem · Answer 11 · Tue Apr 04 2023 20:02:06 GMT+0800 (China Standard Time)

Without gradient accumulation, I did not get it working. With batch size 2 and gradient accumulation (steps 32) it finally started to converge. The behaviour of batchnorm layers might give some reasoning why: https://discuss.pytorch.org/t/does-number-of-gradient-accumulation-steps-affect-models-performance/85859/2

Harrison Wong · Answer 12 · Tue Apr 04 2023 21:46:49 GMT+0800 (China Standard Time)

I found a warmup ratio of 0.1, and as large a batch size as I could fit in memory did the trick (no gradient accumulation).