Do you test in HPSv1 datasets by using HPSv2 checkpoint?

Question

Do you test in HPSv1 datasets by using HPSv2 checkpoint?

LinB203 opened this issue a year ago · comments

Hi, I use HPSv2 checkpoint to test HPSv1 datasets, and I get 59.51% acc. But if I use HPSv1 checkpoint, I get 65.44% acc. Why make it worse? Domain adapter?
Btw, the aesthetic predictor will get 55.57% acc in HPSv1. Does it normal?
The num_images is a tensor of 2, such as [2, 2, 2, 2...].
HPSv2 checkpoint to test HPSv1 datasets

    for batch in bar:
        images, num_images, labels, caption, rank = batch
        images = images.cuda()
        num_images = num_images.cuda()
        # labels = labels.cuda()
        caption = caption.cuda()
        rank = rank.cuda()

        with torch.no_grad():
            image_features = model.encode_image(images)
            text_features = model.encode_text(caption)

            image_features = image_features / image_features.norm(dim=-1, keepdim=True)
            text_features = text_features / text_features.norm(dim=-1, keepdim=True)

            logits_per_image = image_features @ text_features.T
            paired_logits_list = [logit[:, i] for i, logit in enumerate(logits_per_image.split(num_images.tolist()))]
        predicted = [torch.argsort(-k) for k in paired_logits_list]
        hps_ranking = [[predicted[i].tolist().index(j) for j in range(n)] for i, n in enumerate(num_images)]
        rank = [i for i in rank.split(num_images.tolist())]
        score += sum([inversion_score(hps_ranking[i], rank[i]) for i in range(len(hps_ranking))])
    ranking_acc = score / total
    print(ranking_acc)

HPSv1 checkpoint to test HPSv1 datasets

    for batch in bar:
        images, num_images, labels, caption, rank = batch
        images = images.cuda()
        num_images = num_images.cuda()
        # labels = labels.cuda()
        caption = caption.cuda()
        rank = rank.cuda()

        with torch.no_grad():
            with torch.cuda.amp.autocast():
                outputs = model(images, caption)
                image_features, text_features, logit_scale = outputs["image_features"], outputs["text_features"], outputs[
                    "logit_scale"]
                logits_per_image = logit_scale * image_features @ text_features.T 
                paired_logits_list = [logit[:, i] for i, logit in enumerate(logits_per_image.split(num_images.tolist()))]

        predicted = [torch.argsort(-k) for k in paired_logits_list]
        hps_ranking = [[predicted[i].tolist().index(j) for j in range(n)] for i, n in enumerate(num_images)]
        rank = [i for i in rank.split(num_images.tolist())]
        score += sum([inversion_score(hps_ranking[i], rank[i]) for i in range(len(hps_ranking))])
    ranking_acc = score / total * 100
    print(ranking_acc)

Blakey Wu · Answer 1 · Tue Jul 18 2023 19:55:38 GMT+0800 (China Standard Time)

I guess there is something wrong with the numbers emm... The best performance I got on HPD v1 was around 43. num_images should be the number of images with the same prompt in a group. The number is typically 4 or 3 for HPD v1.

lb203 · Answer 2 · Fri Jul 21 2023 20:34:31 GMT+0800 (China Standard Time)

I guess there is something wrong with the numbers emm... The best performance I got on HPD v1 was around 43. num_images should be the number of images with the same prompt in a group. The number is typically 4 or 3 for HPD v1.

Sorry for my late reply. But HPD v1 only specified 1 prefered images in 3 or 4 images. So should I evaluate it like ImageReward dataset? Because ImageReward dataset also has tie. The rank may be like [1,2,2,2] while a list of 4 images?

Blakey Wu · Answer 3 · Sat Jul 22 2023 00:01:36 GMT+0800 (China Standard Time)

In v1, top-1 accuracy is reported, which is different from v2. You can choose to use different ways to evaluate depending on the baseline you are comparing with.

lb203 · Answer 4 · Sat Jul 22 2023 11:10:30 GMT+0800 (China Standard Time)

In v1, top-1 accuracy is reported, which is different from v2. You can choose to use different ways to evaluate depending on the baseline you are comparing with.

Oh! I got it. I reproduced the number 43.2 on HPD v1 by using HPS v1 checkpoint, which is closed to 43.5 in your paper.

lb203 · Answer 5 · Sat Jul 22 2023 12:15:53 GMT+0800 (China Standard Time)

In v1, top-1 accuracy is reported, which is different from v2. You can choose to use different ways to evaluate depending on the baseline you are comparing with.

The following results were all evaluated on HPD v1 test split.
aesthetic 31.5, which is closed to paper's 31.4
CLIP 33.2, which is closed to paper's 32.9
HPS v1 43.2, which is closed to paper's 43.5
HPS v2 36.6, paper N/A
ImageReward 36.0, paper N/A

Do you think it is normal?

Blakey Wu · Answer 6 · Mon Jul 24 2023 19:24:00 GMT+0800 (China Standard Time)

That might be correct. There might be a gap between the data between v1 and v2, because that in v1 the data is not collected without directly prompting the users for their preference.

lb203 · Answer 7 · Mon Jul 24 2023 21:35:31 GMT+0800 (China Standard Time)

That might be correct. There might be a gap between the data between v1 and v2, because that in v1 the data is not collected without directly prompting the users for their preference.

That's reasonable. I futher evaluated more results, and got the following number.
The following results were all evaluated on HPD v2 test split.

aesthetic 76.8, paper 72.6
CLIP 62.5, paper N/A
HPS v1 77.6, 73.1
ImageReward 74.0, paper 70.6
HPS v2 83.3, paper 83.3

I reproduced the number of HPS v2 perfectly, but there is a margin in other methods. Anything I miss?

lb203 · Answer 8 · Tue Aug 01 2023 14:53:40 GMT+0800 (China Standard Time)

@tgxs002 Could you help me to reproduce your results?

Blakey Wu · Answer 9 · Tue Aug 01 2023 21:07:23 GMT+0800 (China Standard Time)

Sorry for the late reply, we are investigating this issue.

Blakey Wu · Answer 10 · Wed Aug 02 2023 00:45:57 GMT+0800 (China Standard Time)

@LinB203 We have checked our record, and there was indeed a bug in an earlier version of our code, which was used to evaluate the baselines. We will provide a detailed explanation for the error in this thread these days, and update the preprint ASAP. Thank you for pointing out the error!

Blakey Wu · Answer 11 · Wed Aug 02 2023 22:33:56 GMT+0800 (China Standard Time)

@LinB203 The difference is because of an out-dated evaluation protocol. When evaluating aesthetic, HPS v1 and ImageReward, we firstly compute the accuracy against the label given by each annotator (10 annotators for each instance in the test set); However, for HPS v2, we were using another codebase (this one), and the accuracy is computed differently. The labels by each annotator are firstly aggregated into an average label, and the accuracy is computed on that. The annotation file with the raw label by each annotator is now updated in the repo. Thank you again for pointing out the misalignment!

lb203 · Answer 12 · Thu Aug 03 2023 10:03:55 GMT+0800 (China Standard Time)

@LinB203 The difference is because of an out-dated evaluation protocol. When evaluating aesthetic, HPS v1 and ImageReward, we firstly compute the accuracy against the label given by each annotator (10 annotators for each instance in the test set); However, for HPS v2, we were using another codebase (this one), and the accuracy is computed differently. The labels by each annotator are firstly aggregated into an average label, and the accuracy is computed on that. The annotation file with the raw label by each annotator is now updated in the repo. Thank you again for pointing out the misalignment!

Yes, I just am using code of this repo to reproduce the results. Anyway, the HPS v2 is the best one, haha...