Poor Image Reconstruction After Fine-tuning on MvTec Dataset

Question

Poor Image Reconstruction After Fine-tuning on MvTec Dataset

cestbonsuliu opened this issue a year ago · comments

I am trying to perform anomaly localization after fine-tuning on the MvTec dataset. Here are some of my settings:

The file structure of the MvTec dataset is as follows:

.
├── train
│   ├── bottle
│   ├── cable
│   ├── capsule
│   ├── carpet
│   ├── grid
│   ├── hazelnut
│   ├── leather
│   ├── metal_nut
│   ├── pill
│   ├── screw
│   ├── tile
│   ├── toothbrush
│   ├── transistor
│   ├── wood
│   └── zipper
└── val
    ├── bottle
    ├── cable
    ├── capsule
    ├── carpet
    ├── grid
    ├── hazelnut
    ├── leather
    ├── metal_nut
    ├── pill
    ├── screw
    ├── tile
    ├── toothbrush
    ├── transistor
    ├── wood
    └── zipper

The "train" folder contains normal samples, while the "val" folder contains abnormal samples.

My fine-tuning command is:

OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=3 main_finetune.py --accum_iter 4 --batch_size 32 --model vit_base_patch16 --finetune checkpoint/mae_pretrain_vit_base.pth --epochs 100 --blr 5e-4 --layer_decay 0.65 --weight_decay 0.05 --drop_path 0.1 --mixup 0.8 --cutmix 1.0 --reprob 0.25 --dist_eval --data_path /home/hehui/AnomalyDetection/Dataset/MvTec --nb_classes 15

Then, I tested the following image:

pre_checkpoint_url = '/home/hehui/AnomalyDetection/Project/mae/output_dir/checkpoint-99.pth'
pre_checkpoints = torch.load(pre_checkpoint_url, map_location=torch.device('cpu'))

mae_base = models_mae.mae_vit_base_patch16()

mae_base.load_state_dict(pre_checkpoints['model'], strict=False)

loss, pred, mask = mae_base(input)
print(loss)

pred = mae_base.unpatchify(pred)

display_image(pred)

The obtained results are as follows:

Can anyone help me with this issue?

鹿见青 commented 7 months ago

thanks

Daisuke Niizumi · Answer 1 · Sat Jul 01 2023 06:03:28 GMT+0800 (China Standard Time)

You need to read the paper to understand what is being done.

The pre-training makes the encoder and decoder work together to reconstruct masked patches.
The fine-tuning makes only the encoder fit the downstream task data distribution.

Then in the fine-tuning stage, the entire MAE should lose the reconstruction capability because the encoder should fine-tune its embedding output to fit the downstream data, regardless of what the decoder expects.

If you pre-train on your dataset, then the MAE will be able to learn to reconstruct your data, but the same goes. Fine-tuning to something will break the reconstruction.
No fine-tuning if you need the reconstruction.

Be careful you are using a dataset big enough for pre-training.