facebookresearch / flores

Facebook Low Resource (FLoRes) MT Benchmark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Mismatch of the size between pretrained model and finetuned model

BaohaoLiao opened this issue · comments

Hi,

when I check the model size of my finetuned model and your provided pretrained model, I notice they are different.

There are two major dicts in the checkpoint, i.e. "model" and "last_optimizer_state". If both are in fp32, the size of "last_optimizer_state" should be roughly as twice big as "model", since there are first and second momentums in the adam optimizer.
For the pretrained model you offered, the sizes are:

  • For pretrained MM100_175M, the size of "model": 336M, the size of "last_optimizer_state": 1.4G
  • For pretrained MM100_615M, the size of "model": 1.2G, the size of "last_optimizer_state": 4.7G

It makes sense, because the pretrained "model" is in fp16 and the "last_optimizer_state" in fp32. The size of "last_optimizer_state" should be roughly as four times as the "model".

However, when I fintune the pretrained model, I meet some problems.

  1. The "model" is saved in fp32 instead of fp16, even though I train with --fp16. My training config is as:
DATA=/path/to/data
TOOL=/path/to/fairseq/train.py
PRETRAINED_MODEL=/path/to/flores101_mm100_615M/model.pt
lang_pairs=/path/to/language_pairs.txt

python $TOOL \
    $DATA \
    --dataset-impl mmap \
    --arch transformer_wmt_en_de_big \
    --dropout 0.1 --attention-dropout 0.1 \
    --encoder-embed-dim 1024 --decoder-embed-dim 1024 \
    --encoder-attention-heads 16 --decoder-attention-heads 16 \
    --encoder-ffn-embed-dim 4096 --decoder-ffn-embed-dim 4096 \
    --encoder-normalize-before --decoder-normalize-before \
    --encoder-layers 12 --decoder-layers 12 \
    --share-all-embeddings \
    --restore-file $PRETRAINED_MODEL \
    --task translation_multi_simple_epoch \
    --encoder-langtok "src" --decoder-langtok \
    --lang-pairs $lang_pairs \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --optimizer adam --adam-eps 1e-08 --adam-betas '(0.9, 0.98)' \
    --fp16 --fp16-init-scale 128  --fp16-scale-tolerance 0.0  --memory-efficient-fp16 \
    --lr-scheduler inverse_sqrt --lr 8e-04 --warmup-init-lr 1e-07 --warmup-updates 2500 \
    --max-tokens 2048  \
    --save-interval 1  
  1. The size of "model" and "last_optimizer_state" are weird.
  • For finetuned MM100_175M, the size of "model" is 1.7G, the size of "last_optimizer_state" is 1.4G.
  • For finetuned MM100_615M, the size of "model" is 4.3G, the size of "last_optimizer_state" is 4.7G.

The sizes of "model" and "last_optimizer_state" are comparable, which is strange to me. Besides, even though I manually change the float of "model" to half, I can only obtain half size of the "model" that is still different with your pretrained "model". For your convenience, you can check my 615M model at https://dynabench.org/models/250

Do you have any ideas for this?

This is the same than facebookresearch/dynalab#99

and is also tracked on facebookresearch/fairseq#3743

I think there is something wrong with how you're doing finetuning, or some misundertanding on how the model sizes are computed.