SunnyHaze / IML-ViT

Official repository of paper “IML-ViT: Benchmarking Image manipulation localization by Vision Transformer”

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

About whether authentic image is involved in training and evaluation?

85zhanghao opened this issue · comments

Regarding the training dataset and test dataset, I would like to know if authentic images are involved in the training and testing of the model?

Hi Zhanghao!

Thank you for your interest in our work! We would like to address your question by referencing the relevant content from the article.

In the Ablation section of the paper, we mentioned: "During ablation studies, we trained the model only with manipulated images in CASIAv2 and evaluated its pixel-level F1 score on CASIAv1, COVERAGE, Columbia, and NIST16 datasets."

In contrast, in the subsequent section comparing with the State-of-the-Art experiments, we explained: "Since MVSS-Net has already conducted a detailed evaluation on a fair cross-dataset protocol, we directly quote their results here and train our models with the same protocol, i.e. training on both authentic and manipulated images of CASIAv2 dataset and testing on public datasets."

To clarify, Table 3 in our ablation experiments only manipulated images. This choice was primarily made to save training time and reduce costs. On the other hand, Table 4, which represents the comprehensive comparison with the SoTA model, follows the implementation of MVSS-Net. In this case, our models were trained with both manipulated and authentic images from the CASIAv2 dataset. This approach aims to reduce false positives and enhance performance metrics, theoretically aligning more with practical application scenarios and requirements.

Anyway, during testing, we only use manipulated images to calculate the pixel-level F1 score. Since a pixel-level F1 score for a sample image (authentic images) without True positive is meaning less.

We hope this addresses your question. Feel free to reach out if you have any further inquiries. Best regards,

Xiaochen Ma

Anyway, we will release the official training code this week, please stay tuned.

Hi Xiaochen Ma! Thanks for the explanation, but I still have some questions about the experiment I would like to ask you.
(1) In the Ablation section where no authentic images were involved in the training, is the best model the one at the 200th epoch?
(2) In the comparing with the State-of-the-Art experiments section, the authentic images and manipulated images in the CASIAv2 dataset are used for training, is the best model chosen based on "the best model in the Defacto-12k dataset (validation set)" or based on "the model in the 200th epoch"?
(3) According to the of F1 metrics,it seems that the ability of pixel-level localization has been improved by adding authentic image training. In Table 4(the experiment "Cross-datasets evaluation of SoTA models" ), is the evaluation of pixel-level F1 score performed on both authentic images and manipulated images? Or is the evaluation performed only on the manipulated images?
(4) In Table 5, the F1 in CASIAv1 reaches 73.4%, what is the difference between the training method in Table 5 and the ablation experiment (Table 3) and Table 4?

Hi, Zhanghao!
Thanks for your thorough reading of our work. Here are the answers.

  1. No, the full training process is 200 epochs, but all reported values are from the best checkpoint during the whole process. It can considered as a kind of early stop as we mentioned at the last of our Implementation section.

  2. For the metrics of defacto-12k, we select the best model between the 200 epochs process to report. Anyway, this is only following the convention of MVSS-Net to report the performance of this dataset. Honestly, an F1 score lower than 0.2 is hardly better than predicting the whole image as "total white". (If you calculate the F1 score with total white mask on CASIAv1, this value is around 0.16) with Out of our paper, we infer with visualization that Defacto may not be suitable for cross-dataset validation for CASIA, since Defacto mostly deals with very tiny tampering targets, while CASIA often deals with larger foreground objects. There exists a huge distribution gap between them.

  3. As I mentioned in yesterday's comments, the F1 score is meaningless with a sample without a True Positive pixel (an authentic image is totally black, with no 'positive' pixels). Thus we only compute the Pixel-level F1 score with manipulated images for all the tests. You can check this formula:
    image

  4. Honestly, after detailed checking of our logs and checkpoints, we found that the value 73.4 is coming from a different setting. Table 5 is filled by my partner, while its value is trained with A40 GPUs(with 48G memory) on CASIAv2 that we can set a batch size of 4. While Table 4 is trained with 3090, batch size 1. Because this paper still requires one more revision, we will properly update this matter to ensure that the implementation section and metrics in the paper are aligned. At the same time, more comprehensive settings and metrics should continue to be updated in the GitHub repository.
    Anyway, if you are comparing the model with 16GB memory consumption, the 0.658 is the proper value.
    Although 48G GPU can achieve better results, the expensive cost also requires objective evaluation.

Thank you once again for your thorough and detailed reading, as well as for bringing attention to specific issues. This will also help us improve areas in the paper that may have been overlooked.

If you have further questions feel free to reach out. 🤗

Sincerely thank you for your reply. I am looking forward to the release of the training code!

Hi, I'm sorry to bother you again. Regarding the loading of the vit pretrain model, I'm not quite sure exactly which model is loaded, could you please provide the download address for the pretrain model?
_27AA}K$9~RQG31(A6WEHW0

Hi!
We have mentioned this in the Training section in our Readme.md. It's the Masked Autoencoder pre-trained weights.
image

十分感谢你公开了训练的代码,关于训练的过程我有几个问题想请教你一下可以吗?(1)图片是padding到1024*1024的大小输入模型进行训练的,那么计算loss的时候,padding的部分也同样参与loss的计算是吗?(2)请问在NVIDIA RTX 3090 GPU上训练200个epoch大概需要多长时间呢?如果只有一个GPU的话,似乎好像需要很久的时间?

你好,感谢你的关注!
在我们设计的过程中padding是要参与loss计算的,印象中如果不算的话,图片的缩放可能会导致一些靠近右下角的区域训练的不好。

我可以给你一个目前的参考时间:3090 四卡,batchsize=2,只训练篡改图像,4分40 一个epoch

希望可以帮到你!

你好,感谢你的关注! 在我们设计的过程中padding是要参与loss计算的,印象中如果不算的话,图片的缩放可能会导致一些靠近右下角的区域训练的不好。

我可以给你一个目前的参考时间:3090 四卡,batchsize=2,只训练篡改图像,4分40 一个epoch

希望可以帮到你!

很抱歉打扰您,我想请问一下训练的代码里面,是把casiv2作为训练集,casiv1作为验证集的意思吗?论文中表4中的结果是按照LN,BN,IN中的哪个来计算的结果呀,还有batchsize。另外想知道评估的代码会发布吗?我自己写的评估代码不知道怎么回事结果好离谱,casiav1的f1整了个0.02 o(╥﹏╥)o,而且想问一下测试是不是很慢,我跑一次测试需要将近两个小时

很抱歉打扰您,我想请问一下训练的代码里面,是把casiv2作为训练集,casiv1作为验证集的意思吗?另外想知道评估的代码会发布吗?

是的。并且评估的话你可以直接使用我们训练时evaluate的代码测试F1的值。具体位置是这个函数,按照相同的接口重新组织下就行:

def test_one_epoch(model: torch.nn.Module,

论文中表4中的结果是按照LN,BN,IN中的哪个来计算的结果呀,还有batchsize。另外想知道评估的代码会发布吗?

一般首选Batchnorm,如果发现收敛有问题,再考虑instance norm.

而且想问一下测试是不是很慢,我跑一次测试需要将近两个小时

不知道你是不是使用的CPU进行的测试,按说只要使用了GPU都很快的,这一点你也可以参考我们提供的colab demo中的inference速度,如果挂载了GPU计算4 5张图几乎是瞬时出。如果只用CPU会比较慢。作为参考,如果使用我上文提到的test_one_epoch函数,3090单卡,batchsize=1,inference 100张图的时间是14秒。

希望能帮到你!

很抱歉打扰您,我想请问一下训练的代码里面,是把casiv2作为训练集,casiv1作为验证集的意思吗?另外想知道评估的代码会发布吗?

是的。并且评估的话你可以直接使用我们训练时evaluate的代码测试F1的值。具体位置是这个函数,按照相同的接口重新组织下就行:

def test_one_epoch(model: torch.nn.Module,

论文中表4中的结果是按照LN,BN,IN中的哪个来计算的结果呀,还有batchsize。另外想知道评估的代码会发布吗?

一般首选Batchnorm,如果发现收敛有问题,再考虑instance norm.

而且想问一下测试是不是很慢,我跑一次测试需要将近两个小时

不知道你是不是使用的CPU进行的测试,按说只要使用了GPU都很快的,这一点你也可以参考我们提供的colab demo中的inference速度,如果挂载了GPU计算4 5张图几乎是瞬时出。如果只用CPU会比较慢。作为参考,如果使用我上文提到的test_one_epoch函数,3090单卡,batchsize=1,inference 100张图的时间是14秒。

希望能帮到你!

非常感谢您!我发现前面的问题是我模型没加载对,现在f1结果正常多了,确实F1和auc的计算是在cpu计算的,现在f1调用了论文里的算法,但auc我只会计算cpu上的,不知道怎么在gpu上计算auc,请问这个代码你是怎么写的,在cpu计算真的超级慢

我们这个算法有一个重要的步骤是要先把zero-padding去掉,只计算有效的图片区域,而不是直接对1024x1024的矩阵进行计算,我猜想这个也可能是影响你的原因。

AUC我们也只是调用的sklearn里面的的method实现的,大概的一部分函数可能长这样,你可以参考一下:

def cal_precise_AUC_with_shape(predict, target, shape):
    predict2 = predict[0][0][:shape[0][0], :shape[0][1]]
    target2 = target[0][0][:shape[0][0], :shape[0][1]]
    predict3 = predict2.reshape(-1).cpu()
    target3 = target2.reshape(-1).cpu()
    # -----visualize roc curve-----
    fpr, tpr, thresholds = roc_curve(target3, predict3, pos_label=1)
    plt.plot(fpr, tpr)
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.savefig("./appro2.png")
    # ------------------------------
    AUC = roc_auc_score(target3, predict3)
    return AUC

说实话肯定是要比F1慢一些的,这个算法本来就要用不同的threshold去卡一遍整个矩阵,计算复杂度是比F1高的。

希望能帮到你!

我们这个算法有一个重要的步骤是要先把zero-padding去掉,只计算有效的图片区域,而不是直接对1024x1024的矩阵进行计算,我猜想这个也可能是影响你的原因。

AUC我们也只是调用的sklearn里面的的method实现的,大概的一部分函数可能长这样,你可以参考一下:

def cal_precise_AUC_with_shape(predict, target, shape):
    predict2 = predict[0][0][:shape[0][0], :shape[0][1]]
    target2 = target[0][0][:shape[0][0], :shape[0][1]]
    predict3 = predict2.reshape(-1).cpu()
    target3 = target2.reshape(-1).cpu()
    # -----visualize roc curve-----
    fpr, tpr, thresholds = roc_curve(target3, predict3, pos_label=1)
    plt.plot(fpr, tpr)
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.savefig("./appro2.png")
    # ------------------------------
    AUC = roc_auc_score(target3, predict3)
    return AUC

说实话肯定是要比F1慢一些的,这个算法本来就要用不同的threshold去卡一遍整个矩阵,计算复杂度是比F1高的。

希望能帮到你!

很抱歉再次打扰,我又遇到了问题,用Nist16数据集测试的时候,可能因为这个数据集的图片分辨率很大,都大于1024,计算roc时报错,ValueError: Only one class present in y_true. ROC AUC score is not defined in that case. 它说真实mask只有一个类别,我怀疑可能是图片缩成10241024的时候裁剪到只剩下一种类别了。
image
然后我又把AUC的代码注释掉只计算F1,结果f1也报错了,
然后我想起来计算f1的时候代码用了切片在1024
1024上提取有效区域,但是实际图片的shape比1024大,导致越界了,我不知道这种情况文章中是直接计算1024*1024大小吗?还是resize回原本的图片大小? 我是先把切片去掉了,直接用1024计算f1,结果也出现了警告“UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no true nor predicted samples. Use zero_division parameter to control this behavior.
_warn_prf(average, "true nor predicted", "F-score is", len(true_sum))”,这个警告是说分母为0,感觉也不太对劲。难道数据集处理的时候裁剪为1024直接把有的mask裁成了一种类别?然后这样计算的f1非常非常低。请问对大分辨率的图片这个测试要如何处理?我的f1和auc都无法正确计算了
还有想问一下生成边界上,edge_width的设置上默认是7,这个7有什么具体含义吗?

对于第一个问题,所有大于1024的图片都要先预处理,NIST16我们是先resize到长边=1024的图,然后测试的。

还有想问一下生成边界上,edge_width的设置上默认是7,这个7有什么具体含义吗?

这个数在这个project早期就定下来了,是因为目测数据集的时候发现有时候mask因为是手工标注,经常和实际的篡改痕迹有偏移,所以肯定要宽一点。然而对于目前“在线”的edge生成算法,这个数不能太大,会导致augmentation部分在CPU的速度变慢。后来就取7了。

我有点这个印象,我应该是修改过,但我也记不清改了哪些了。你邮件联系我邮箱一下我发你一份我用的version看看有什么区别。

目前我们将所有修改过的数据集上传到了这个仓库,方便用于复现IML-ViT与完成后续工作,欢迎star或分享给其他Researcher,感谢!

Currently, we have uploaded all modified datasets to this repository for easy replication of IML-ViT and for completing subsequent work. Feel free to star or share with other researchers. Thank you!

https://github.com/SunnyHaze/IML-Dataset-Corrections