Evaluation Code

Hi, thanks for your great work, I am inspired a lot from it.
You compare your work with other text2room generation method on the 2D metric and user study, could you release the code of evaluation on 2D metric? the Clip score and Inception score of rendered image.

@lukasHoel hi, I have run your code by feeding customized text prompt, and evaluate the renderings of the generated room mesh by CLIP score, but I only get the score at the range among 24~25, how do you get the figure in the Table.1 in your paper ?
Specifically, I am using 'openai/clip-vit-base-patch16' do calculate the clip score.

Hi, sorry for the late response. We also used openai/clip-vit-base-patch16 and calculated the CLIP score to the same text prompt used for generating the scene. We report averaged scores for a bunch of images. Specifically, we use only images that show the scene from novel viewpoints, by calculating the clip score on all of these images:

I also attach a small script that we used to calculate the CLIP score on a folder of images:

import argparse
import os
import json
from import tqdm
import torch
import numpy as np
from PIL import Image
from torchmetrics.multimodal import CLIPScore

def pil_to_torch(img, device, normalize=True):
    img = torch.tensor(np.array(img), device=device).permute(2, 0, 1)
    if normalize:
        img = img / 255.0
    return img

def main(args):
    clip_score = CLIPScore(model_name_or_path="openai/clip-vit-base-patch16").cuda()
    images = [, f)) for f in os.listdir(args.image_folder) if "png" in f or "jpg" in f]

    n_images = len(images)
    scores = torch.zeros(n_images, device=clip_score.device)

    pbar = tqdm(images, desc="Calc CLIP Score")
    for i, img in enumerate(pbar):
        img_torch = pil_to_torch(img, clip_score.device, normalize=False)
        score = clip_score(img_torch, args.prompt)
        scores[i] = score.detach()

    out_dict = {
        "scores": [s.cpu().numpy().item() for s in scores],
        "mean": scores.mean().cpu().numpy().item(),
        "std": scores.std().cpu().numpy().item(),

    with open(os.path.join(args.out_path, "clip_score.json"), "w") as f:
        json.dump(out_dict, f)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)

    parser.add_argument('--image_folder', required=True)
    parser.add_argument('--prompt', required=True)
    parser.add_argument('--out_path', required=False, default="")

    args = parser.parse_args()


The same applies also for the Inception Score. I attach a similar script here:

import argparse
import os
import json
import torch
import numpy as np
from PIL import Image
from torchmetrics.image.inception import InceptionScore

def pil_to_torch(img, device, normalize=True):
    img = torch.tensor(np.array(img), device=device).permute(2, 0, 1)
    if normalize:
        img = img / 255.0
    return img

def main(args):
    inception_score = InceptionScore().cuda()
    images = [, f)) for f in os.listdir(args.image_folder) if "png" in f or "jpg" in f]
    images = torch.stack([pil_to_torch(i, inception_score.device, normalize=False) for i in images], dim=0)

    out = inception_score.compute()

    out_dict = {
        "mean": out[0].cpu().numpy().item(),
        "std": out[1].cpu().numpy().item(),

    with open(os.path.join(args.out_path, "inception_score.json"), "w") as f:
        json.dump(out_dict, f)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)

    parser.add_argument('--image_folder', required=True)
    parser.add_argument('--out_path', required=False, default="")

    args = parser.parse_args()
