Toy Case Experiment
Our spectral analysis reveals that RoPE's low-frequency components are redundant, exhibiting minimal phase shift across temporal indices (left). This insight is confirmed by empirical masking experiments showing that generation quality is preserved even when these low-frequency bands are replaced with identity mappings, whereas masking high frequencies leads to model collapse (right).
Method
ReRoPE enables relative camera control by repurposing the redundant low-frequency temporal bands of a pre-trained Video DiT to inject camera projection signals, while keeping high-frequency and spatial bands intact to preserve generative priors. This plug-and-play design effectively supports both Video-to-Video and Image-to-Video tasks without disrupting the backbone's original capabilities.
More Results
Video-to-Video (V2V)
Image-to-Video (I2V)
Source Image
Camera 07
Camera 10
Source Image
Camera 05
Camera 09
Source Image
Camera 05
Camera 09
Source Image
Camera 06
Camera 07
Source Image
Camera 02
Camera 03