DiffusionDet: A Diffusion Approach to Object Detection
DiffusionDet: Diffusion Model for Object Detection
We propose a new framework that formulates object detection as a denoising diffusion process from noisy boxes to object boxes.
During training stage, object boxes diffuse from ground-truth boxes to random distribution, and the model learns to reverse this noising process.
In inference, the model refines a set of randomly generated boxes to the output results in a progressive way.
The extensive evaluations on the standard benchmarks, including ms-coco and lvis, show that the proposed framework achieves favorable performance compared to previous well-established detectors.
Object detection, one of the representative perception tasks, can be solved by a generative way.
We propose a new framework that formulates object detection as a denoising diffusion process from noisy boxes to object boxes.
During training stage, object boxes diffuse from ground-truth boxes to random distribution, and the model learns to reverse this noising process.
In inference, the model refines a set of randomly generated boxes to the output results in a progressive way.
Extensive evaluations on the standard benchmarks, including ms-coco and lvis, show that the proposed framework achieves favorable performance compared to previous well-established detectors.
Our work brings two important findings in object detection.
First, random boxes, although drastically different from pre-defined anchors or learned queries, are also effective object candidates.
Second, object detection, one of the representative perception tasks, can be solved by a generative way.
Object detection aims to predict a set of bounding boxes and associated category labels for targeted objects in one image.
As a fundamental visual recognition task, it has become the cornerstone of many related recognition scenarios, such as instance segmentation, pose estimation, action recognition, object tracking, and visual relationship detection.
Modern object detection approaches have been evolving with the development of object candidates, from empirical object priors to learnable object queries.
While these works achieve a simple and effective design, they still have a dependency on a fixed set of learnable queries.
We answer this question by designing a novel framework that directly detects objects from a set of random boxes.
Starting from purely random boxes, which do not contain learnable parameters that need to be optimized in training, we expect to gradually refine the positions and sizes of these boxes until they perfectly cover the targeted objects.
This approach does not require heuristic object priors nor learnable queries, further simplifying the object candidates and pushing the development of the detection pipeline forward.
Result
In this work, we propose a novel detection paradigm, diffusiondet, by viewing object detection as a denoising diffusion process from noisy boxes to object boxes.
Our noise-to-box pipeline has several appealing properties, including dynamic box and progressive refinement, enabling us to use the same network parameters to obtain the desired speed-accuracy trade-off without re-training the model.
Experiments on standard detection benchmarks show that diffusiondet achieves favorable performance compared to well-established detectors.