ReconFusion: 3D Reconstruction with Diffusion Priors
Abstract
3D reconstruction methods such as Neural Radiance Fields (NeRFs) excel at rendering photorealistic novel views of complex scenes. However, recovering a high-quality NeRF typically requires tens to hundreds of input images, resulting in a time-consuming capture process. We present ReconFusion to reconstruct real-world scenes using only a few photos. Our approach leverages a diffusion prior for novel view synthesis, trained on synthetic and multiview datasets, which regularizes a NeRF-based 3D reconstruction pipeline at novel camera poses beyond those captured by the set of input images. Our method synthesizes realistic geometry and texture in underconstrained regions while preserving the appearance of observed regions. We perform an extensive evaluation across various real-world datasets, including forward-facing and 360-degree scenes, demonstrating significant performance improvements over previous few-view NeRF reconstruction approaches.
ReconFusion = 3D Reconstruction + Diffusion Prior
(a) We optimize a NeRF to minimize a reconstruction loss \(\mathcal{L}_\mathrm{recon}\) between renderings and a few input images, as well as a sample loss \(\mathcal{L}_\mathrm{sample}\) between a rendering from a random pose and an image predicted by a diffusion model for that pose. (b) To generate the sample image, we use a PixelNeRF-style model to fuse information from the input images and to render a predicted feature map corresponding to the sample view camera pose. (c) This feature map is concatenated with the noisy latent (computed from the current NeRF rendering at that pose) and is provided to a diffusion model, which additionally uses CLIP embeddings of the input images via cross-attention. The resulting decoded output sample is used to enforce an image-space loss on the corresponding NeRF rendering (\(\mathcal{L}_\mathrm{sample}\)).
ReconFusion enables high-quality 3D reconstruction from few views
ReconFusion generalizes to everyday scenes: the same diffusion model prior is used for all reconstruction results.
ReconFusion outperforms other few-view NeRF methods
Baseline method (left) vs ReconFusion (right). Scene trained on 9 views. Try selecting different methods and scenes!
ReconFusion improves both few-view and many-view reconstruction
Our diffusion prior improves performance over baseline Zip-NeRF in both the few-view and many-view sampling regimes.
3 | 6 | 9 | 18 | 27 | 54 | 81 |
ReconFusion distills a consistent 3D model from inconsistent samples
LLFF (3 views) | CO3D (6 views) | mip-NeRF 360 (9 views) | |
---|---|---|---|
3D Reconstruction | |||
Samples |
ReconFusion recovers consistent 3D reconstructions (top) from a diffusion model that produces image samples independently for each viewpoint (bottom). These samples are not multiview consistent, but can produce high-quality 3D reconstructions when used as a prior in optimization.
Citation
Acknowledgements
We would like to thank Arthur Brussee, Ricardo Martin-Brualla, Rick Szeliski, Peter Hedman, Jason Baldridge, and Angjoo Kanazawa for their valuable contributions in discussing the project and reviewing the manuscript, and Zhicheng Wang for setting up some of the data loaders necessary for our diffusion model training pipeline. We are grateful to Randy Persaud and Henna Nandwani for infrastructure support. We also extend our gratitude to Shlomi Fruchter, Kevin Murphy, Mohammad Babaeizadeh, Han Zhang and Amir Hertz for training the base text-to-image latent diffusion model.
The website template was borrowed from Michaël Gharbi and Ref-NeRF.