UniT: A Unified Transformer Model for Multi-Task Learning
Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer
We propose a unified transformer model to simultaneously learn the most prominent tasks across different domains, ranging from object detection to language understanding and multimodal reasoning.
Our model encodes each input modality with an encoder and makes predictions on each task with a shared decoder over the encoded input representations, followed by task-specific output heads.
The entire model is jointly trained end-to-end with losses from each task.
In our experiments, we learn 7 tasks jointly over 8 datasets, achieving comparable performance to well-established prior work on each domain under the same supervision with a compact set of model parameters.