SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation
We present a novel framework built to simplify 3d asset generation for amateur users.
To enable a variety of multi-modal inputs, we employ task-specific encoders with dropout followed by a cross-attention mechanism.
Most interestingly, our model can combine all these tasks into one swiss-army-knife tool, enabling the user to perform shape generation using incomplete shapes, images, and textual descriptions at the same time, providing the relative weights for each input and facilitating interactive generation.
Authors
Yen-Chi Cheng, Hsin-Ying Lee, Sergey Tulyakov, Alexander Schwing, Liangyan Gui
We introduce, a diffusion-based generative model with a signed distance function (sdf) under the hood, acting as our 3d representation.
To learn the probability distribution over the introduced latent space, we leverage diffusion models, which have recently been used with great success in various 2d generation tasks.
We foresee an ideal collaborative paradigm for generative methods where models trained on 3d data provide detailed and accurate geometry, while models trained on 2d data provide diverse appearances.
Result
We present a diffusion model trained on signed distance functions for 3d shape generation.
To alleviate the computationally demanding nature of 3d representations, we first encode 3d shapes into an expressive low-dimensional latent space, which we use to train the diffusion model.
To enable flexible conditional usage, we adopt class-specific encoders along with a cross-attention mechanism for handling conditions from multiple modalities, and leverage classifier-free guidance to facilitate weight control among modalities.
We further demonstrate an application that takes advantage of a pretrained 2d text-to-image model to texture a generated 3d shape.