TextCraft: Zero-Shot Generation of High-Fidelity and Diverse Shapes from Text
Language is one of the primary means by which we describe the 3D world around
us. While rapid progress has been made in text-to-2D-image synthesis, similar
progress in text-to-3D-shape synthesis has been hindered by the lack of paired
(text, shape) data. Moreover, extant methods for text-to-shape generation have
limited shape diversity and fidelity. We introduce TextCraft, a method to
address these limitations by producing high-fidelity and diverse 3D shapes
without the need for (text, shape) pairs for training. TextCraft achieves this
by using CLIP and using a multi-resolution approach by first generating in a
low-dimensional latent space and then upscaling to a higher resolution,
improving the fidelity of the generated shape. To improve shape diversity, we
use a discrete latent space which is modelled using a bidirectional transformer
conditioned on the interchangeable image-text embedding space induced by CLIP.
Moreover, we present a novel variant of classifier-free guidance, which further
improves the accuracy-diversity trade-off. Finally, we perform extensive
experiments that demonstrate that TextCraft outperforms state-of-the-art
baselines.