Mismatch of the size between pretrained model and finetuned model
BaohaoLiao opened this issue · comments
Hi,
when I check the model size of my finetuned model and your provided pretrained model, I notice they are different.
There are two major dicts in the checkpoint, i.e. "model" and "last_optimizer_state". If both are in fp32, the size of "last_optimizer_state" should be roughly as twice big as "model", since there are first and second momentums in the adam optimizer.
For the pretrained model you offered, the sizes are:
- For pretrained MM100_175M, the size of "model": 336M, the size of "last_optimizer_state": 1.4G
- For pretrained MM100_615M, the size of "model": 1.2G, the size of "last_optimizer_state": 4.7G
It makes sense, because the pretrained "model" is in fp16 and the "last_optimizer_state" in fp32. The size of "last_optimizer_state" should be roughly as four times as the "model".
However, when I fintune the pretrained model, I meet some problems.
- The "model" is saved in fp32 instead of fp16, even though I train with --fp16. My training config is as:
DATA=/path/to/data
TOOL=/path/to/fairseq/train.py
PRETRAINED_MODEL=/path/to/flores101_mm100_615M/model.pt
lang_pairs=/path/to/language_pairs.txt
python $TOOL \
$DATA \
--dataset-impl mmap \
--arch transformer_wmt_en_de_big \
--dropout 0.1 --attention-dropout 0.1 \
--encoder-embed-dim 1024 --decoder-embed-dim 1024 \
--encoder-attention-heads 16 --decoder-attention-heads 16 \
--encoder-ffn-embed-dim 4096 --decoder-ffn-embed-dim 4096 \
--encoder-normalize-before --decoder-normalize-before \
--encoder-layers 12 --decoder-layers 12 \
--share-all-embeddings \
--restore-file $PRETRAINED_MODEL \
--task translation_multi_simple_epoch \
--encoder-langtok "src" --decoder-langtok \
--lang-pairs $lang_pairs \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--optimizer adam --adam-eps 1e-08 --adam-betas '(0.9, 0.98)' \
--fp16 --fp16-init-scale 128 --fp16-scale-tolerance 0.0 --memory-efficient-fp16 \
--lr-scheduler inverse_sqrt --lr 8e-04 --warmup-init-lr 1e-07 --warmup-updates 2500 \
--max-tokens 2048 \
--save-interval 1
- The size of "model" and "last_optimizer_state" are weird.
- For finetuned MM100_175M, the size of "model" is 1.7G, the size of "last_optimizer_state" is 1.4G.
- For finetuned MM100_615M, the size of "model" is 4.3G, the size of "last_optimizer_state" is 4.7G.
The sizes of "model" and "last_optimizer_state" are comparable, which is strange to me. Besides, even though I manually change the float of "model" to half, I can only obtain half size of the "model" that is still different with your pretrained "model". For your convenience, you can check my 615M model at https://dynabench.org/models/250
Do you have any ideas for this?
This is the same than facebookresearch/dynalab#99
and is also tracked on facebookresearch/fairseq#3743
I think there is something wrong with how you're doing finetuning, or some misundertanding on how the model sizes are computed.