Image-and-Language Understanding from Pixels Only - 42Papers