Shortcut-V2V: Compression Framework for Video-to-Video Translation based on Temporal Redundancy Reduction

Abstract

Video-to-video translation aims to generate video frames of a target domain from an input video. Despite its usefulness, the existing video-to-video translation methods require enormous computations, necessitating their model compression for wide use. While there exist compression methods that improve computational efficiency in various image/video tasks, a generally-applicable compression method for video-to-video translation has not been studied much. In response, this paper presents Shortcut-V2V, a general-purpose compression framework for video-to-video translation. Shortcut-V2V avoids full inference for every neighboring video frame by approximating the intermediate features of a current frame from those of the preceding frame. Moreover, in our framework, a newly-proposed block called AdaBD adaptively blends and deforms features of neighboring frames, which makes more accurate predictions of the intermediate features possible. We conduct quantitative and qualitative evaluations using well-known video-to-video translation models on various tasks to demonstrate the general applicability of our framework. The results show that Shoutcut-V2V achieves comparable performance compared to the original video-to-video translation model while saving 3.2-5.7x computational cost and 7.8-44x memory at test time.

Pipeline

In this paper, we propose Shortcut-V2V, a general compression framework to improve the test-time efficiency in video-to-video translation. As illustrated in Figure 1.(a), given {I_t}^N_T-1_t=0 as input video frames, we first use full teacher model T to synthesize the output of the first frame. Then, for the next frames, our newly-proposed Shortcut block efficiently approximates f_t, the features from the l_d-th decoding layer of the teacher model. This is achieved by leveraging the l_e-th-th encoding layer features a_t along with reference features, a_ref and f_ref, from the previous frame. Here, l_d and l_e correspond to layer indices of the teacher model. Lastly, predicted features f̂_t are injected into the following layers of the teacher model to synthesize the final output Ô_t. To avoid error accumulation, we conduct full teacher inference and update the reference features at every max interval α. Turing Machine

Figure 1. Overview of the proposed ShortCutV2V. (a) is an overall architecture of Shorcut-V2V, and (b) shows a detailed architecture of Shortcut block.

Qualitative Results

We conduct experiments on various tasks with widely-used video-to-video translation models, Unsupervised RecycleGAN and vid2vid.

Unsupervised RecycleGAN

Unsupervised RecycleGAN (Unsup) is the state-of-art video-to-video translation model among CycleGAN-based ones. We perform experiments for Unsup on Viper → Cityscapes (V2C) and translate the videos from Viper dataset to their corresponding segmentation label maps, Viper → Label (V2L), and vice versa, Label → Viper (L2V).

vid2vid

vid2vid is a widely-used Pix2PixHD-based video-to-video translation network that serves as a base architecture for various recent video-to-video translation models. According to the original model, we evaluate Shortcut-V2V on Edge → Face (E2F) and Label → Cityscapes (L2C).

Comparison to Baselines

Since this is the first work that tackles a generally applicable compression framework for video-to-video translation, we compare Shortcut-V2V to the existing compression methods for image-to-image translation, CAT and OMGD, regarding video frames as individual images. In the case of vid2vid, we additionally conduct a comparison to Fast-Vid2Vid, which is the compression method designed specifically for Vid2Vid.

Unsupervised RecycleGAN

vid2vid

Application to StyleGAN2-based Network

We apply Shortcut-V2V to a recently-proposed StyleGAN2-based video-to-video translation model, VToonify. VToonify consists of an encoder and a StyleGAN2 generator. Our Shortcut block is trained to replace the teacher model layers from the encoder's final layer to the generator's initial layer. We employ two Shortcut blocks to approximate the features from two distinct branches in a layer of the StyleGAN2 generator: 'skip' and 'toRGB'.

BibTeX

@misc{chung2023shortcutv2v,
      title={Shortcut-V2V: Compression Framework for Video-to-Video Translation based on Temporal Redundancy Reduction}, 
      author={Chaeyeon Chung and Yeojeong Park and Seunghwan Choi and Munkhsoyol Ganbat and Jaegul Choo},
      year={2023},
      eprint={2308.08011},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}