Multi-modal Vision Transformers for Object Detection
Multi-modal Transformers Excel at Class-agnostic Object Detection
We advocate that existing methods lack a top-down supervision signal governed by human-understandable semantics.
To bridge this gap, we explore recent multi-modal vision transformers (mvit) that have been trained with aligned image-text pairs.
Our extensive experiments across various domains and novel objects show the state-of-the-art performance of mvits to localize generic objects in images.
Based on these findings, we develop an efficient and flexible mvit architecture using multi-scale feature processing and deformable self-attention that can adaptively generate proposals given a specific language query.
We show the significance of mvit proposals in a diverse range of applications including open-world object detection, salient and camouflageobject detection, supervised and self-supervised detection tasks.
Authors
Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan, Rao Muhammad Anwer, Ming-Hsuan Yang