microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Home Page:https://aka.ms/GeneralAI

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Is LMv2 using resnext101 or just res101?

Senwang98 opened this issue · comments

Describe
Model I am using (LayoutLMv2):

Hi, I'm doing some works based on LMv2, but I found FUNSD's F1 score is lower than that reported in paper.
So, I want to know the open source code using which visual backbone now?
It seems that the provided code about detectron2 config is using Res101 rather X101, so the f1 score is lower.
I am not sure my option is right or not.

BTW, I want to know that if I train CORD dataset based on BERT-base with longer epoches, BERT-base f1 score will higher than BERT-Large?
(When I reproduced CORD dataset on LMv2, I trained LMv2 10epoch rather than 5epoch, I found my BERT-base test f1 score = 96.3)

commented

Hi @Senwang98 , we use resnext101 as the visual backbone. Loading the weight from the pre-trained model directly is ok, just make sure parameters are loaded correctly.

For the 2nd question, I think the first step is checking the evaluation and data process code, if there is nothing wrong in the code, try to make sure whether this phenomenon could be reproduced on other models(such as Bert).

@Dod-o
Ok, thanks for your reply!
For Q1, I think you are right.
For Q2, I am new here in NLP, I will check my codes.
BTW, Do you have any plan to release training codes about CORD just like FUNSD?

@Dod-o
I re-trained LMv2(layoutlmv2-base-uncased) with CORD, 5 epoch achieve 95.4 and 10 epoch got 96.03

***** eval metrics *****
  epoch                   =       10.0
  eval_accuracy           =     0.9776
  eval_f1                 =     0.9741
  eval_loss               =     0.1345
  eval_precision          =     0.9756
  eval_recall             =     0.9725
  eval_runtime            = 0:00:02.73
  eval_samples_per_second =     36.594
  eval_steps_per_second   =      4.757
***** Running Prediction *****
  Num examples = 100
  Batch size = 8
 92%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎          | 12/13 [00:02<00:00,  4.59it/s]***** test metrics *****
  test_accuracy           =     0.9728
  test_f1                 =     0.9603
  test_loss               =      0.162
  test_precision          =     0.9592
  test_recall             =     0.9614
  test_runtime            = 0:00:02.70
  test_samples_per_second =     36.974
  test_steps_per_second   =      4.807
commented

hi @Senwang98, The CORD code is similar to FUNSD, refer to the FUNSD code is ok.

And I'd like to know how did you calculate the F1? Did you use the token-level F1 rather than entity-level?

@Dod-o
I copy the metric codes from run_funsd.py.
So, I think I use token-level F1.

commented

@Senwang98 I think we use entity-level metrics, maybe you can double-check about this.