ReconFusion

ReconFusion: 3D Reconstruction with Diffusion Priors

¹Columbia University, ²Google Research, ³Google DeepMind * equal contribution

CVPR 2024

Abstract

3D reconstruction methods such as Neural Radiance Fields (NeRFs) excel at rendering photorealistic novel views of complex scenes. However, recovering a high-quality NeRF typically requires tens to hundreds of input images, resulting in a time-consuming capture process. We present ReconFusion to reconstruct real-world scenes using only a few photos. Our approach leverages a diffusion prior for novel view synthesis, trained on synthetic and multiview datasets, which regularizes a NeRF-based 3D reconstruction pipeline at novel camera poses beyond those captured by the set of input images. Our method synthesizes realistic geometry and texture in underconstrained regions while preserving the appearance of observed regions. We perform an extensive evaluation across various real-world datasets, including forward-facing and 360-degree scenes, demonstrating significant performance improvements over previous few-view NeRF reconstruction approaches.

ReconFusion = 3D Reconstruction + Diffusion Prior

(a) We optimize a NeRF to minimize a reconstruction loss \(\mathcal{L}_\mathrm{recon}\) between renderings and a few input images, as well as a sample loss \(\mathcal{L}_\mathrm{sample}\) between a rendering from a random pose and an image predicted by a diffusion model for that pose. (b) To generate the sample image, we use a PixelNeRF-style model to fuse information from the input images and to render a predicted feature map corresponding to the sample view camera pose. (c) This feature map is concatenated with the noisy latent (computed from the current NeRF rendering at that pose) and is provided to a diffusion model, which additionally uses CLIP embeddings of the input images via cross-attention. The resulting decoded output sample is used to enforce an image-space loss on the corresponding NeRF rendering (\(\mathcal{L}_\mathrm{sample}\)).

ReconFusion enables high-quality 3D reconstruction from few views

ReconFusion generalizes to everyday scenes: the same diffusion model prior is used for all reconstruction results.

ReconFusion outperforms other few-view NeRF methods

Zip-NeRF
DiffusioNeRF

RGB Depth

Baseline method (left) vs ReconFusion (right). Scene trained on 9 views. Try selecting different methods and scenes!

ReconFusion improves both few-view and many-view reconstruction

Our diffusion prior improves performance over baseline Zip-NeRF in both the few-view and many-view sampling regimes.

Move the slider to adjust the number of views. The left column shows the nearest input view.

	3	6	9	18	27	54	81

ReconFusion distills a consistent 3D model from inconsistent samples

	LLFF (3 views)	CO3D (6 views)	mip-NeRF 360 (9 views)
3D Reconstruction
Samples

ReconFusion recovers consistent 3D reconstructions (top) from a diffusion model that produces image samples independently for each viewpoint (bottom). These samples are not multiview consistent, but can produce high-quality 3D reconstructions when used as a prior in optimization.

Citation

@article{wu2023reconfusion, title={ReconFusion: 3D Reconstruction with Diffusion Priors}, author={Rundi Wu and Ben Mildenhall and Philipp Henzler and Keunhong Park and Ruiqi Gao and Daniel Watson and Pratul P. Srinivasan and Dor Verbin and Jonathan T. Barron and Ben Poole and Aleksander Holynski}, journal={arXiv}, year={2023} }

Acknowledgements

We would like to thank Arthur Brussee, Ricardo Martin-Brualla, Rick Szeliski, Peter Hedman, Jason Baldridge, and Angjoo Kanazawa for their valuable contributions in discussing the project and reviewing the manuscript, and Zhicheng Wang for setting up some of the data loaders necessary for our diffusion model training pipeline. We are grateful to Randy Persaud and Henna Nandwani for infrastructure support. We also extend our gratitude to Shlomi Fruchter, Kevin Murphy, Mohammad Babaeizadeh, Han Zhang and Amir Hertz for training the base text-to-image latent diffusion model.

The website template was borrowed from Michaël Gharbi and Ref-NeRF.