In dense image segmentation tasks (e.g., semantic, panoptic), existing
methods can hardly generalize well to unseen image domains, predefined classes,
and image resolution & quality variations. Motivated by these observations, we
construct a large-scale entity segmentation dataset to explore fine-grained
entity segmentation, with a strong focus on open-world and high-quality dense
segmentation. The dataset contains images spanning diverse image domains and
resolutions, along with high-quality mask annotations for training and testing.
Given the high-quality and -resolution nature of the dataset, we propose
CropFormer for high-quality segmentation, which can improve mask prediction
using high-res image crops that provide more fine-grained image details than
the full image. CropFormer is the first query-based Transformer architecture
that can effectively ensemble mask predictions from multiple image crops, by
learning queries that can associate the same entities across the full image and
its crop. With CropFormer, we achieve a significant AP gain of $1.9$ on the
challenging fine-grained entity segmentation task. The dataset and code will be
released at this http URL
Authors
Lu Qi, Jason Kuen, Weidong Guo, Tiancheng Shen, Jiuxiang Gu, Wenbo Li, Jiaya Jia, Zhe Lin, Ming-Hsuan Yang