DEVIL: Evaluation of Text-to-Video Generation Models:
A Dynamics Perspective

Mingxiang Liao^1*, Hannan Lu^2*, Xinyu Zhang^3,4*, Fang Wan¹, Tianyu Wang¹, Yuzhong Zhao¹, Wangmeng Zuo², Qixiang Ye¹, Jingdong Wang⁴
¹University of Chinese Academy of Sciences, ²Harbin Institute of Technology,
³The University of Adelaide, ⁴Baidu Inc.

Neurips 2024

*Indicates Equal Contribution

ArXiv GitHub

Figure 1. Flowchart to calculate dynamics metrics based on dynamics scores and text prompts.

Abstract

Comprehensive and constructive evaluation protocols play an important role in the development of sophisticated text-to-video (T2V) generation models. Existing evaluation protocols primarily focus on temporal consistency and content continuity, yet largely ignoring the dynamics of video content. Dynamics are an essential dimension for measuring the visual vividness and the honesty of video content to text prompts. In this study, we propose an effective evaluation protocol, termed DEVIL, which centers on the dynamics dimension to evaluate T2V models. For this purpose, we establish a new benchmark comprising text prompts that fully reflect multiple dynamics grades, and define a set of dynamics scores corresponding to various temporal granularities to comprehensively evaluate the dynamics of each generated video. Based on the new benchmark and the dynamics scores, we assess T2V models with the design of three metrics: dynamics range, dynamics controllability, and dynamics-based quality. Experiments show that DEVIL achieves a Pearson correlation exceeding 90% with human ratings, demonstrating its potential to advance T2V generation models.

Figure 2. Leaderboard, still updating.

Figure 3. Illustration of dynamics at multiple temporal granularities, and video quality distribution w.r.t. dynamic scores.

Dynamics score

By altering the temporal granularities, video dynamics can be categorized into three: (i) Inter-frame dynamics, which describes the variations between successive frames. The dynamics score at this level reflects rapid and prominent content variations. (ii) Inter-segment dynamics, which refers to the changes between video segments which contains K video frames. Define on a middle-level, this score captures middle-speed transitions and motion patterns. (iii) Video-level dynamics, which encompasses the overall content diversity and the frequency of changes throughout the video.

Evaluating Generated Videos

Dynamics score: 0.163, 0.152

The video content is mostly static.

Dynamics score: 0.249, 0.294

The video features a single type of dynamics, such as lighting change and object motion.

Dynamics score: 0.517, 0.451

The video includes various types of movements and morphological changes, as well as diverse patterns of change.

Dynamics score: 0.656, 0.648

The video either contains intense movements and morphological changes or features a wide range of change patterns.

Dynamics score: 0.826, 0.872

As the camera swiftly navigates through different scenes, the rapid changes in the viewpoint and surrounding environment contribute to an extremely high level of dynamics.

Evaluating Real Videos

Dynamics score: 0.129, 0.141

The video content is mostly static.

Dynamics score: 0.209, 0.256

The video features a single type of dynamics, such as lighting change and camera motion.

Dynamics score: 0.463, 0.474

The video contains various types of changes, such as camera motion and object motion, or multi-instance dynamics.

Dynamics score: 0.600, 0.650

The video features a wide range of change patterns, or intense lightning change.

Dynamics score: 0.769, 0.845

The video exhibits a high degree of dynamics due to its rich content, rapid motion, and complex motion patterns.

BibTeX

@article{liao2024evaluation,
title={Evaluation of text-to-video generation models: A dynamics perspective},
author={Liao, Mingxiang and Lu, Hannan and Zhang, Xinyu and Wan, Fang and Wang, Tianyu and Zhao, Yuzhong and Zuo, Wangmeng and Ye, Qixiang and Wang, Jingdong},
journal={arXiv preprint arXiv:2407.01094},
year={2024}}