UPainting: Unified Text-to-Image Diffusion Generation with Cross-modal Guidance - 42Papers