FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation

Yuanxin Liu¹ Lei Li¹ Shuhuai Ren¹ RunDong Gao¹ Shicheng Li¹
Sishuo Chen¹ Xu Sun¹ Lu Hou²

¹Peking University ²Huawei Noah’s Ark Lab

News 🚀

[2023-12] Release evaluation code to FETV-EVAL.

[2023-11] Update more detailed information about FETV data and evaluation results.

Overview

The FETV benchmark
Manual evaluation of T2V generation models
Diagnosis of automatic T2V generation metrics

FETV Benchmark

FETV consist of a diverse set of text prompts, categorized based on three orthogonal aspects: major content, attribute control, and prompt complexity. This enables fine-grained evaluation of T2V generation models.

Data Instances

All FETV data are all available in the file fetv_data.json. Each line is a data instance, which is formatted as:

{
  "video_id": "1006807024", 
  "prompt": "A mountain stream", 
  "major content": {
       "spatial": ["scenery & natural objects"], 
       "temporal": ["fluid motions"]
     }, 
  "attribute control": {
      "spatial": null, 
      "temporal": null
    }, 
  "prompt complexity": ["simple"], 
  "source": "WebVid", 
  "video_url": "https://ak.picdn.net/shutterstock/videos/1006807024/preview/stock-footage-a-mountain-stream.mp4",
  "unusual type": null
  }

Temporal Major Contents Temporal Attributes to Control Spatial Major Contents Spatial Attributes to Control

Data Fields

"video_id": The video identifier in the original dataset where the prompt comes from.
"prompt": The text prompt for text-to-video generation.
"major content": The major content described in the prompt.
"attribute control": The attribute that the prompt aims to control.
"prompt complexity": The complexity of the prompt.
"source": The original dataset where the prompt comes from, which can be "WebVid", "MSRVTT" or "ours".
"video_url": The url link of the reference video.
"unusual type": The type of unusual combination the prompt involves. Only available for data instances with "source": "ours".

Dataset Statistics

FETV contains 619 text prompts. The data distributions over different categories are as follows (the numbers over categories do not sum up to 619 because a data instance can belong to multiple categories)

Manual Evaluation of Text-to-video Generation Models

We evaluate four T2V models, namely CogVideo, Text2Video-zero, ModelScopeT2V and ZeroScope. The generated and ground-truth videos are manually evaluated from four perspectives: static quality, temporal quality, overall alignment and fine-grained alignment. Examples of generated videos and manual ratings can be found here

Results of static and temporal video quality

Results of video-text alignment

Diagnosis of Automatic Text-to-video Generation Metrics

We develop automatic metrics for video quality and video-text alignment based on the UMT model, which exhibit higher correlation with humans than existing metrics.

Video-text alignment evaluation correlation with human

Video-text alignment ranking correlation with human PS: The above video-text correlation results are slightly different from the previous version because we fixed some bugs in calculating BLIPScore and CLIPscore. The advantage of UMTScore is more obvious in the updated results.

Video-text alignment ranking example Video quality ranking correlation with human

Todo

Upload evaluation codes.

License

This dataset is under CC-BY 4.0 license.

Citation

@article{liu2023fetv,
  title   = {FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation},
  author  = {Yuanxin Liu and Lei Li and Shuhuai Ren and Rundong Gao and Shicheng Li and Sishuo Chen and Xu Sun and Lu Hou},
  year    = {2023},
  journal = {arXiv preprint arXiv: 2311.01813}
}

llyx97 / FETV