ReconFusion: 3D Reconstruction with Diffusion Priors

1Columbia University,   2Google Research,   3Google DeepMind     * equal contribution
CVPR 2024
arXiv Data

Abstract

3D reconstruction methods such as Neural Radiance Fields (NeRFs) excel at rendering photorealistic novel views of complex scenes. However, recovering a high-quality NeRF typically requires tens to hundreds of input images, resulting in a time-consuming capture process. We present ReconFusion to reconstruct real-world scenes using only a few photos. Our approach leverages a diffusion prior for novel view synthesis, trained on synthetic and multiview datasets, which regularizes a NeRF-based 3D reconstruction pipeline at novel camera poses beyond those captured by the set of input images. Our method synthesizes realistic geometry and texture in underconstrained regions while preserving the appearance of observed regions. We perform an extensive evaluation across various real-world datasets, including forward-facing and 360-degree scenes, demonstrating significant performance improvements over previous few-view NeRF reconstruction approaches.


ReconFusion = 3D Reconstruction + Diffusion Prior

(a) We optimize a NeRF to minimize a reconstruction loss \(\mathcal{L}_\mathrm{recon}\) between renderings and a few input images, as well as a sample loss \(\mathcal{L}_\mathrm{sample}\) between a rendering from a random pose and an image predicted by a diffusion model for that pose. (b) To generate the sample image, we use a PixelNeRF-style model to fuse information from the input images and to render a predicted feature map corresponding to the sample view camera pose. (c) This feature map is concatenated with the noisy latent (computed from the current NeRF rendering at that pose) and is provided to a diffusion model, which additionally uses CLIP embeddings of the input images via cross-attention. The resulting decoded output sample is used to enforce an image-space loss on the corresponding NeRF rendering (\(\mathcal{L}_\mathrm{sample}\)).



ReconFusion enables high-quality 3D reconstruction from few views



ReconFusion generalizes to everyday scenes: the same diffusion model prior is used for all reconstruction results.


ReconFusion outperforms other few-view NeRF methods


RGB Depth

Baseline method (left) vs ReconFusion (right). Scene trained on 9 views. Try selecting different methods and scenes!

DTU/scan31 DTU/scan45 LLFF/fern LLFF/horns Re10K/sofa Re10K/living room CO3D/bench CO3D/plant mip-NeRF 360/bonsai mip-NeRF 360/kitchen


ReconFusion improves both few-view and many-view reconstruction



Our diffusion prior improves performance over baseline Zip-NeRF in both the few-view and many-view sampling regimes.

Move the slider to adjust the number of views. The left column shows the nearest input view.
3 6 9 18 27 54 81




ReconFusion distills a consistent 3D model from inconsistent samples


LLFF (3 views) CO3D (6 views) mip-NeRF 360 (9 views)
3D Reconstruction
          Samples          

ReconFusion recovers consistent 3D reconstructions (top) from a diffusion model that produces image samples independently for each viewpoint (bottom). These samples are not multiview consistent, but can produce high-quality 3D reconstructions when used as a prior in optimization.



Citation

Acknowledgements

We would like to thank Arthur Brussee, Ricardo Martin-Brualla, Rick Szeliski, Peter Hedman, Jason Baldridge, and Angjoo Kanazawa for their valuable contributions in discussing the project and reviewing the manuscript, and Zhicheng Wang for setting up some of the data loaders necessary for our diffusion model training pipeline. We are grateful to Randy Persaud and Henna Nandwani for infrastructure support. We also extend our gratitude to Shlomi Fruchter, Kevin Murphy, Mohammad Babaeizadeh, Han Zhang and Amir Hertz for training the base text-to-image latent diffusion model.

The website template was borrowed from Michaël Gharbi and Ref-NeRF.