A special purpose learning system assumes knowledge of admissible tasks at
design time. Adapting such a system to unforeseen tasks requires architecture
manipulation such as adding an output head for each new task or dataset. In
this work, we propose a task-agnostic vision-language system that accepts an
image and a natural language task description and outputs bounding boxes,
confidences, and text. The system supports a wide range of vision tasks such as
classification, localization, question answering, captioning, and more. We
evaluate the system's ability to learn multiple skills simultaneously, to
perform tasks with novel skill-concept combinations, and to learn new skills
efficiently and without forgetting.