microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Home Page:https://aka.ms/GeneralAI

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Layoutlmv3 LayoutLMv3ForSequenceClassification not working

cwei-bgl opened this issue · comments

Hi,

Thanks for release and sharing LayoutlmV3.

I am trying out LayoutLMV3. LayoutLMv3ForTokenClassification works as expected, but LayoutLMv3ForSequenceClassification does not train. I noticed there is a difference between v2 and v3 implementations for LayoutLMv3ForSequenceClassification below.

v2: there are 3 pieces of information are feeded into classifier layer.

sequence_output = torch.cat(
            [cls_final_output, pooled_initial_image_embeddings, pooled_final_image_embeddings], dim=1
        )

v3: only first cls token is feeded into classifier layer. The 513th image cls token is not taken advantage also.

sequence_output = outputs[0][:, 0, :]

According to the paper, it seemed that LayoutLMv3ForSequenceClassification was used and tested in RVL-CDIP dataset for classification. Could you please confirm that LayoutLMv3ForSequenceClassification was tested or there is something missing in the implementation?

Thanks, Cheng

Yes, we use the first CLS token for sequence calssification following RoBERTa.
We have also tried LayoutLMv2's method or classsification with the image CLS token, but we didn't observe an improvement, so we just used the simplest method.
Referring to ViT, we kept the image CLS token, but it was not used it in LayoutLMv3.
In our experiments, LayoutLMv3ForSequenceClassification is used for the document image classification task on RVL-CDIP.

Thanks @HYPJUDY for the prompt reply and confirmation. That is great. I will check if there is any step I did wrong.

Hi @HYPJUDY , I have just done some experiments. With simple data set, LayoutLMv3ForSequenceClassification was able to train, but the training performance and the final result is not as good as LayoutLMForSequenceClassification. LayoutLMv3ForSequenceClassification was much slower to train than LayoutLMForSequenceClassification. With a slightly complicated dataset, LayoutLMv3ForSequenceClassification was unable to train. Whereas, LayoutLMForSequenceClassification was ok and was able to train well. Do you have some suggestion?

By the way, is there a chance that pretraining objective source code can be shared? I am trying to implement pretraining(so that the pretrained model is more suitable to my domain). I am currently reading the referenced papers and trying to write the objectives. If pretraining objective source code can be shared, that would be wonderful.

Thanks, Cheng

Hi @cwei-bgl.
Did you use images when using LayoutLMForSequenceClassification? Without images, LayoutLM may be faster than LayoutLMv3.
When you say "unable to train", do you mean that the losses diverge or is there something else going on? You may need to tune parameters (e.g., learning rate, batch size) for different datasets.
Regarding the pre-training code, please see #733.

Thanks for prompt reply @HYPJUDY . You are right. When I ran with LayoutLMForSequenceClassification, I fed it without image. For LayoutLMv3ForSequenceClassification, I tried both with image and without image. LayoutLM is faster, as it does not have visual embedding. I compared their performance by epochs.

The loss did not diverge, but the loss did not improve. The predicted labels were constant, e.g., always predict as [8, 8, 8, 8, 8...]. I also trained it with autograd.detect_anomaly() and nn.utils.clip_grad_norm_(.., error_if_nonfinite=True) on, so if grad was nan or infinite, I would notice it. It was ok. For microsoft/layoutlmv3-large, I did observe grad infinite with fp16 on.

Thanks, I will try different learning rate, but for batch size, it will depend on machine. There is not much room for me to turn.

Good luck with you experiments! BTW, the gradient accumulation mechanism can help to expand the batch size.

Thanks @HYPJUDY for the pointer. You are right. I can expand the batch size with gradient accumulation.

Hi @HYPJUDY , just let you know expanding the batch size does help. LayoutLMv3ForSequenceClassification was able to train with slightly more complicated dataset. At the moment, its performance is not better than LayoutLMForSequenceClassification.

strange why batch size should be such an important factor, any knowledge as to why?

I have the same issue that loss always goes to NaN... Tested it with the HF class and I am getting the feeling that there is either a massive bug in there or something misconfigured.

Without gradient accumulation, I did not get it working. With batch size 2 and gradient accumulation (steps 32) it finally started to converge. The behaviour of batchnorm layers might give some reasoning why: https://discuss.pytorch.org/t/does-number-of-gradient-accumulation-steps-affect-models-performance/85859/2

I found a warmup ratio of 0.1, and as large a batch size as I could fit in memory did the trick (no gradient accumulation).