Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
To a simple sequence-to-sequence learning framework based on the encoder-decoder architecture.
The proposed model performs pretraining and finetuning with task instructions and introduces no extra task-specific layers for finetuning.
Experimental results show that the proposed model achieves new state-of-the-arts on a series of multimodal tasks, including image captioning (test-std acc.
Authors
Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, Hongxia Yang