Multi-modal Vision Transformers for Object Detection - 42Papers