Restereo: Unifying diffusion stereo video generation and restoration

CVPR Workshop on Synthetic Data for Computer Vision (SynData4CV), 2026.

Xingchang Huang^1,2, Ashish Kumar Singh⁴, Florian Dubost⁴, Cristina Nader Vasconcelos⁵, Sakar Khattar⁴, Liang Shi⁴, Christian Theobalt^1,2, Cengiz Öztireli^3,4, Gurprit Singh^1,2

¹MPI Informatics, ²VIA Center, ³University of Cambridge, ⁴Google, ⁵Google Deepmind

ArXiv Code and Data (coming soon)

Interpolate start reference image.

Teaser: Our method, called Restereo, is fine-tuned to be a unified model that performs joint and robust stereo video generation and restoration under various levels of degradation (noise, downsampling, and no degradation). Input videos above are from Pixabay, DAVIS, and Waymo Open Dataset, respectively. The quality of all stereo videos in the paper can also be examined by cross-eye viewing given the left- and right-images, flipping the left and right views in an image viewer, or through the following videos compatible with a VR headset.

Abstract

Stereo video generation has gained increasing interest, yet most recent video diffusion models remain limited to monocular outputs. Existing stereo approaches also struggle with degraded inputs and focus on stereo generation only. In this paper, we propose a unified diffusion framework that jointly performs stereo video generation and restoration from degraded video. The key idea is to inject video degradations during training and condition the model on warped masks, allowing it to learn robust two-view consistency directly from data, even though explicit view-consistency losses are not feasible in latent diffusion models. This design allows effective fine-tuning on small synthetic datasets and supports a single model that handles different degradation types, including downsampling and noise addition. Experiments show that our method outperforms existing baselines across varying levels of degradation, while matching their performance on non-degraded inputs and achieving a superior quality–time tradeoff.

Framework

Interpolate start reference image.

Results

Here we show the videos of all results presented in the main paper. Feel free to download the videos to check the quality via a VR headset, cross-eye view, or flipping the left-view and right-view to see the stereo effect.

↓ Comparison on downsampling-degraded video from our created test dataset and DAVIS test dataset

Input

Real-ESRGAN + StereoCrafter

FMA-Net + StereoCrafter

DLoRAL + StereoCrafter

Ours

Input

Real-ESRGAN + StereoCrafter

FMA-Net + StereoCrafter

DiffIR2VR-Zero + StereoCrafter

Ours

↓ Comparison on noise-degraded video from our created test dataset and DAVIS test dataset

Input

Real-ESRGAN + StereoCrafter

DiffIR2VR-Zero + StereoCrafter

VRT + StereoCrafter

Ours

Input

Real-ESRGAN + StereoCrafter

DiffIR2VR-Zero + StereoCrafter

VRT + StereoCrafter

Ours

↓ Comparable results between GEN3C and Ours on long self-driving non-degraded video:

GEN3C (note that for GEN3C the right-view is input and left-view is generated, opposite to Ours)

Ours

GEN3C (note that for GEN3C the right-view is input and left-view is generated, opposite to Ours)

Ours

GEN3C (note that for GEN3C the right-view is input and left-view is generated, opposite to Ours)

Ours

GEN3C (note that for GEN3C the right-view is input and left-view is generated, opposite to Ours)

Ours

↓ Our generated left-view and right-view videos in the teaser or other figures

Our generated output (input from our created test dataset)

Our generated output (input from DAVIS test dataset)

Our generated output (Input from Waymo test dataset):

Our generated output (Input from our created test dataset):

Our generated output (Input from our created test dataset):