Mask-Image-Language trimodal encoder - 42Papers