MultiInstruct: A Benchmark for Multimodal Instruction Tuning
MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning
Instruction tuning, a new learning paradigm that fine-tunes pre-trained language models on tasks specified through instructions, has shown promising zero-shot performance on various natural language processing tasks.
However, it is still not explored for vision and multimodal tasks.
In this work, we introduce multiinstruct, the first multimodal instruction tuning benchmarkdataset that consists of 47 diverse multimodal task covering 11 broad categories.
Each task is designed at least with 5,000 instances (input-outpairs) from existing open-source datasets and 5 expert-written instructions.
Experimental results demonstrate its strong zero-shot performance and the benefit of transfer learning from text-only instructions.
We also design a new evaluation metric: sensitivity, to evaluate how sensitive the model is to the variety of instructions.
Our results indicate that the model is less sensitive to the varying instructions after finetuning on a diverse set of tasks and instructions for each task.
Instruction tuning has achieved significant success in zero-shot learning on natural language processing tasks.
By fine-tuning a pre-trained multimodal language model on tasks described through instructions, instruction tuning allows the model to learn to understand and follow the instructions to perform predictions on unseen tasks.
In this work, we propose, the first benchmark dataset for multimodal instruction tuning with 47 diverse tasks from 11 broad categories that require visual understanding and multimodal reasoning.
We formulate all the tasks into a unified sequence-to-sequence format in which the input text, images, instructions, and bounding boxes are represented in the same token space.
For each task, we create at least 5,000 instances and 5 instructions that are manually written by two experts in natural language processing.
Experimental results demonstrate strong zero-shot performance on various unseen multimodal tasks with instruction tuning and the potential of further improving it by leveraging large-scale text-only instruction datasets.
Result
Multimodal instruction tuning is a key question in instruction guidance and zero-shot performance on multimodal tasks.
One key question is how to effectively leverage the large-scale text-only instruction datasets to enhance the instruction guidance, zero-short performance, and performance on all the unseen evaluation tasks and metrics.
Here we show that simply fine-tuning the original pre-trained multimodal language model (ofa) on actually degrades the model s zero-shooting performance, as shown by comparing the model only trained on the mixture of and, which is possibly due to the imbalance of the training tasks from these two datasets.
On the other hand, simply training the model on a mixture of and does not lead to the same level of performance improvement compared with the original pre-trained ofa only trained on, demonstrating the potential benefit of the much larger text-only training datasets to multimodal instruction tuning.