Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling - 42Papers