MV2MV: Multi-View Image Translation via View-Consistent Diffusion Models
Figure 1: We present MV2MV, a unified multi-view image to multi-view image translation framework, enabling various multi-view image translation tasks such as super-resolution (top row), text-driven editing (2nd and 3rd rows), etc. Our method achieves high-quality results with fine details while maintaining view consistency.
Abstract
Image translation has various applications in computer graphics and computer vision, aiming to transfer images from one domain to another. Thanks to the excellent generation capability of diffusion models, recent single-view image translation methods achieve realistic results. However, directly applying diffusion models for multi-view image translation remains challenging for two major obstacles: the need for paired training data and the limited view consistency. To overcome the obstacles, we present a first unified multi-view image to multi-view image translation framework based on diffusion models, called MV2MV. Firstly, we propose a novel self-supervised training strategy that exploits the success of off-the-shelf single-view image translators and the 3D Gaussian Splatting (3DGS) technique to generate pseudo ground truths as supervisory signals, leading to enhanced consistency and fine details. Additionally, we propose a latent multi-view consistency block, which utilizes the latent-3DGS as the underlying 3D representation to facilitate information exchange across multi-view images and inject 3D prior into the diffusion model to enforce consistency. Finally, our approach simultaneously optimizes the diffusion model and 3DGS to achieve a better trade-off between consistency and realism. Extensive experiments across various translation tasks demonstrate that MV2MV outperforms task-specific specialists in both quantitative and qualitative.
Download
paper(~2M)
code (coming soon)
Slide (coming soon)
Method
We propose a unified multi-view image to multi-view image translation framework, called MV2MV, based on diffusion models for various multi-view image translation tasks such as super-resolution, denoising, deblurring and text-driven editing (see Fig 1). Firstly, we introduce a novel self-supervised training strategy, called Consistent and Adversarial Supervision (CAS). Specifically, we first process multi-view images individually using off-the-shelf single-view image translators to obtain a set of high-quality outputs, and then feed them into 3D Gaussian Splatting (3DGS) to average out the inconsistencies and yield consistent outputs. These two outputs are regarded as pseudo ground truths serving as supervisory signals, and consistent loss and adversarial loss are introduced to effectively combine the advantages of the two pseudo ground truths to ensure both consistency and realism. Secondly, we propose a plug-in latent multi-view consistency block, named LAConsistNet, to construct our view-consistent diffusion model (VCDM). Specifically, the LAConsistNet block utilizes a latent-3DGS as the underlying 3D representation to ensure information exchange among multi-view images, thereby guaranteeing multi-view consistency. Finally, we introduce a joint optimization strategy by simultaneously training VCDM and 3DGS to ensure the consistency of the details derived from the adversarial loss, resulting in a better trade-off between consistency and realism.Figure 2: Overview of MV2MV. Given multi-view images (top left), we utilize the CAS strategy (right) to train the proposed VCDM (left), which exploits the success of off-the-shelf single-view image translators and 3DGS to generate pseudo ground truths as supervisory signals. LAConsistNet utilizes a latent-3DGS as the underlying 3D representation to enable information exchange across multi-view images, ensuring 3D consistency. The joint optimization strategy simultaneously optimizes VCDM and 3DGS to achieve a better trade-off between consistency and realism.
Talk
The video will be uploaded after the conference.
Results
Figure 3: Qualitative results on super-resolution. Our method is able to generate more realistic and more sharper details.
Figure 4: Qualitative results on denoising. Our method effectively removes noise while restoring detailed texture.
Figure 5: Qualitative results on deblurring. Our method removes motion blur and generates detailed textures.
Figure 6: Qualitative results on text-driven editing. Our method generates results that are more consistent and of better quality than previous state-of-the-art methods.
Acknowledgments
We would like to thank the anonymous reviewers for their constructive suggestions and comments. This work is supported by the National Key R&D Program of China (2022YFB3303400), the National Natural Science Foundationof China (62025207), and Laoshan Laboratory (No.LSKJ202300305).
Bibtex
@ article{ Cai2024MV2MV, title = {MV2MV: Multi-View Image Translation via View-Consistent Diffusion Models}, author = {Youcheng Cai, Runshi Li, Ligang Liu}, journal = {ACM Transactions on Graphics (SIGGRAPH Asia 2024)}, volume = {43}, number = {6}, year = {2024}}