The goal of this paper is to augment a pre-trained text-to-image diffusion
model with the ability of open-vocabulary objects grounding, i.e.,
simultaneously generating images and segmentation masks fo
Recent progress in diffusion models has revolutionized the popular technology
of text-to-image generation. While existing approaches could produce
photorealistic high-resolution images with text condi
Recent breakthroughs in text-to-image synthesis have been driven by diffusion
models trained on billions of image-text pairs. Adapting this approach to 3D
synthesis would require large-scale datasets
In this paper, we make the first attempt to introduce the explicit 3d shape prior to clip-guided 3d optimization methods.
Specifically, we first generate a high-quality 3d shape from input texts in the text-to-shape stage as the 3d shape prior and then utilize it as the initialization of a neural radiance field and then optimize it with the full prompt.
We present a new generation process that enables versatile and controllable image generation, using a pre-trained text-to-image diffusion model, without any further training or finetuning.
Our approach is based on an optimization task that binds together multiple diffusion generation processes with a shared set of parameters or constraints.
We present unitune, a simple and novel method for general text-driven imageediting.
Unitune gets as input an arbitrary image and a textual editdescription, and carries out the edit while maintaining high semantic and visual fidelity to the input image.
Text-to-Image models have introduced a remarkable leap in the evolution of
machine learning, demonstrating high-quality synthesis of images from a given
text-prompt. However, these powerful pretrained
We present Imagen, a text-to-image diffusion model with an unprecedented
degree of photorealism and a deep level of language understanding. Imagen
builds on the power of large transformer language mod
Text-conditioned image editing has recently attracted considerable interest.
However, most methods are currently either limited to specific editing types
(e.g., object overlay, style transfer), or app
While recent work on text-conditional 3D object generation has shown
promising results, the state-of-the-art methods typically require multiple
GPU-hours to produce a single sample. This is in stark c
We propose the swinv2-imagen, a novel text-to-image diffusion model based on a hierarchical visual transformer and a scene graph incorporating a semantic layout.
In the proposed model, the feature vectors of entities and relationships are extracted and involved in the diffusion model, effectively improving the quality of generated images.
We show that the classifier-free guidance can be leveraged as a critic and enable generators to distill knowledge from large-scale text-to-image diffusion models to efficiently shift into new domains indicated by text prompts without access to groundtruth samples from target domains.
The proposed method is the first attempt at incorporating large-scale pre-trained diffusion models and distillation sampling for text-driven image generator domain adaptation and gives a quality previously beyond possible.
Text-to-image synthesis has been a revolutionary breakthrough in the evolution of generative artificial intelligence (generative ai), allowing us to synthesize diverse images that convey highly complex visual concepts.
However, a pivotal challenge in leveraging such models for real-world content creation tasks is providing users with control over the generated content.
Diffusion models (DMs) have become the new trend of generative models and
have demonstrated a powerful ability of conditional synthesis. Among those,
text-to-image diffusion models pre-trained on larg
Transferring large amount of high resolution images over limited bandwidth is
an important but very challenging task. Compressing images using extremely low
bitrates (<0.1 bpp) has been studied but it
Large-scale text-to-image generative models have shown their remarkable
ability to synthesize diverse and high-quality images. However, it is still
challenging to directly apply these models for editi
Distribution shifts are a major source of failure of deployed machine
learning models. However, evaluating a model's reliability under distribution
shifts can be challenging, especially since it may b
Diffusion models have shown remarkable capabilities in generating high
quality and creative images conditioned on text. An interesting application of
such models is structure preserving text guided im