A storyboard is a roadmap for video creation which consists of shot-by-shot images to visualize key plots in a text synopsis.
Creating video storyboards however remains challenging which not only requires association between high-level texts and images,
but also demands for long-term reasoning to make transitions smooth across shots.
We propose a new task called Text synopsis to Video Storyboard (TeViS),
which aims to retrieve an ordered sequence of images to visualize the text synopsis.
Previous works on text-to-image retrieval can only produce static images without considering the dynamics of shots in the video.
Text-to-video works are able to retrieve or generate videos.
Yet, most of them focus on short-term video clips with only a few seconds as shown in Fig. 1 (a).
The images in these videos are highly redundant and cannot satisfy the requirement of a video storyboard for coherent keyframes.
Previous works on visual storytelling works are proposed to visualize text with a sequence of images,
but they care more about the text-image relevancy while omitting long-term reasoning to make transition smooth across keyframes (see Fig. 1 (b)).
Moreover, the query texts in existing works are visually concrete and descriptive,
making the models less generalizable to more abstract and high-level text synopses such as the synopsis in Fig. 1 (c).
In the TeViS task, we aim to retrieve an ordered sequence of
images from large-scale movie database as video
storyboard to visualize an input text synopsis. For this purpose,
we collect the MovieNet-TeViS benchmark based on
the public MovieNet dataset.
Our collected MovieNet-TeViS dataset consists of 10,000 pairs of a synopsis sentence in English and a video storyboard,
i.e., a sequence of keyframes as our final dataset. There are 45,584 keyframes in total.
The number of keyframes in a storyboard ranges from 3 to 11 and about 60\% storyboards consist of 3 or 4 keyframes.
The average number of words in a synopsis sentence is about 24.
Tab. 1 presents the comparison of our dataset and related movie datasets.
It shows that the duration of movie clips corresponding to a description in LSMDC and MAD is only 4 seconds
and the average number of words in a description is only 9-12, which is much lower than ours.
It is impossible to extract a meaningful storyboard from such short clips.
Our MovieNet-TeViS and CMD give a synopsis or summary of 64-second or 132-second video segments respectively.
Thus we can expect the text in CMD is higher-level than that in ours, which is proved by lower avgConcreteness
(Please read our paper for more details).
Compared to CMD, our MovieNet-Tevis uses more words to describe video segments with half the duration of CMD.
This indicates that our text synopses provide more details than those in CMD.
As a start of such a challenging new task,
our dataset is the most appropriate in duration of video clip and semantic level of text.
We design two evaluation settings for the TeViS task:
1. Ordering the Shuffled Keyframes Subtask:
For a given text synopsis and its shuffled ground-truth images,
how well can the models order them? To measure the long-term reasoning capability of models for ordering,
we let the models order the ground-truth images and apply Kendall' τ to evaluate
how well the ordered list matches with the ground-truth list.
2. Retrieve-and-Ordering Keyframes Subtask:
For a given text synopsis, how well can the models select the relevant images from a large set of candidates
and then order them? This task is more practical in real situations.
For this evaluation, we are given a text synopsis and a large set of candidate images.
The candidate images contain ground-truth images annotated by humans,
and other negative images which are randomly sampled from other images in the corpus.
The number of candidates including ground-truth and negative samples is 500.
We consider both retrieval and ordering performance and apply the product of Recall@K
and Kendall' τ as the final metric of this subtask.
Here the Kendall' τ is calculated upon the returned ground-truth images at top K only.
To provide a start point for tackling the task, we propose a text-to-image retrieval module based on a pre-trained image-text model, and an encoder-decoder module for ordering images. A coherence-aware pre-training method is further proposed to leverage large-scale movies to improve coherence across frames for the ordering module. The framework of our VQ-Trans model for the TeViS task is shown in the flow Figure.
In addition to the proposed VQ-Trans model, we design three strong baselines based on CLIP for ordering as shown in Fig. C.2:
Qualitative examples of different models for the ordering task on our Movie-TeViS dataset. The kendall' τ metrics of our approach for Ordering Task and Retrieve-and-Order Task are shown in Tab. 2 and Tab. 5. To learn more details, please refer to our paper..
Nevertheless, there is still a large gap compared to human performance suggesting room for promising future work.
@InProceedings{Gu_2023_TeViS,
author = {Xu Gu, Yuchong Sun, Feiyue Ni, Shizhe Chen, Xihua Wang, Ruihua Song, Boyuan Li, Xiang Cao},
title = {TeViS: Translating Text Synopses to Video Storyboards},
year = {2023},
isbn = {9798400701085},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3581783.3612417},
doi = {10.1145/3581783.3612417},
booktitle = {Proceedings of the 31st ACM International Conference on Multimedia},
pages = {4968–4979},
numpages = {12},
keywords = {movie, datasets, storyboard, synopsis},
location = {Ottawa ON, Canada},
series = {MM '23}
}