ReRoPE: Repurposing RoPE for Relative Camera Control

Toy Case Experiment

Our spectral analysis reveals that RoPE's low-frequency components are redundant, exhibiting minimal phase shift across temporal indices (left). This insight is confirmed by empirical masking experiments showing that generation quality is preserved even when these low-frequency bands are replaced with identity mappings, whereas masking high frequencies leads to model collapse (right).

Method

ReRoPE enables relative camera control by repurposing the redundant low-frequency temporal bands of a pre-trained Video DiT to inject camera projection signals, while keeping high-frequency and spatial bands intact to preserve generative priors. This plug-and-play design effectively supports both Video-to-Video and Image-to-Video tasks without disrupting the backbone's original capabilities.