Chunyang Li1*, Yuanbo Yang1*, Jiahao Shao1*, Hongyu Zhou1, Katja Schwarz2, Yiyi Liao1†

1Zhejiang University    2SpAItial

* Equal contribution    † Corresponding author

TL;DR: ReRoPE is a plug-and-play framework for controllable video generation that achieves precise camera control by injecting relative pose information into the underutilized low-frequency bands of standard Rotary Positional Embeddings (RoPE).

Toy Case Experiment

Our spectral analysis reveals that RoPE's low-frequency components are redundant, exhibiting minimal phase shift across temporal indices (left). This insight is confirmed by empirical masking experiments showing that generation quality is preserved even when these low-frequency bands are replaced with identity mappings, whereas masking high frequencies leads to model collapse (right).

Method

ReRoPE enables relative camera control by repurposing the redundant low-frequency temporal bands of a pre-trained Video DiT to inject camera projection signals, while keeping high-frequency and spatial bands intact to preserve generative priors. This plug-and-play design effectively supports both Video-to-Video and Image-to-Video tasks without disrupting the backbone's original capabilities.

Method Pipeline

More Results

Video-to-Video (V2V)

Input Video Synthesized Video

Image-to-Video (I2V)

Input Image Synthesized Video
Source Image

Source Image

Camera 07

Camera 10

Source Image

Source Image

Camera 05

Camera 09

Source Image

Source Image

Camera 05

Camera 09

Source Image

Source Image

Camera 06

Camera 07

Source Image

Source Image

Camera 02

Camera 03