DDCap: A diffusion-based text decoding framework for image captioning
Exploring Discrete Diffusion Models for Image Captioning
The image captioning task is typically realized by an auto-regressive method that decodes the text tokens one by one.
However, texts in image captions are categorical and short with varied lengths.
Therefore, naively applying the discrete diffusion model to textdecoding does not work well, as shown in our experiments.
To address the performance gap, we propose several key techniques including best-first inference, concentrated attention mask, text length prediction, and image-free training.
With 4m vision-language pre-training images and the base-sized model, we reach a cider score of 125.1 on coco without additional caption pre-training, which is +5.0 higher than the auto-regressive baseline with the same architecture in the controlled setting.
It also performs +26.8 higher on a captioninfilling task.
Authors
Zixin Zhu, Yixuan Wei, Jianfeng Wang, Zhe Gan, Zheng Zhang, Le Wang, Gang Hua, Lijuan Wang, Zicheng Liu, Han Hu
In this work, we conduct a systematic study on how we may adopt diffusion models for accurate image captioning.
Our first attempt of naively applying either continuous or discrete diffusion models for image captioning did not work well, producing results that are much more inferior to state-of-the-art results produced from mainstream auto-regressive models pre-trained on millions of data and even billions of data.
This motivated us to conduct a careful gap analysis and in observation that text tokens are discrete in nature, we focused our research exploration on making discrete diffusion model produce accurate image captions.
Our proposed model, namely ddcap, is specifically designed to address the above gaps.
First of all, we add a network branch to predict the total token length to flexibly accommodate variable lengths of the texts.
Second, we design a module so that we may adaptively concentrate on more informative tokens.
Thirdly, we propose a first best inference strategy, where the top- @xmath0 recovered tokens from each diffusion step remain unchanged in subsequent diffusion steps.
In other words, in each diffusion step, we only add noises to those tokens that are not marked as fixed in previous diffusion steps.
Result
This paper introduces a novel discrete diffusion model, termed ddcap, for image captioning.
Ddcap proposes several novel designs : length prediction, concentrated attention mask, and best-first inference.
Ablation experiments under controlled settings demonstrate the effectiveness of each component.
Results on coco datast show that ddcap is comparable with state-of-the-art auto-regressive methods.
In addition, we have introduced a new caption infilling task to highlight our advantages.