Perceiver IO: A General Architecture for Structured Inputs & Outputs
The recently-proposed Perceiver model obtains good results on several domains
(images, audio, multimodal, point clouds) while scaling linearly in compute and
memory with the input size. While the Perceiver supports many kinds of inputs,
it can only produce very simple outputs such as class scores. Perceiver IO
overcomes this limitation without sacrificing the original's appealing
properties by learning to flexibly query the model's latent space to produce
outputs of arbitrary size and semantics. Perceiver IO still decouples model
depth from data size and still scales linearly with data size, but now with
respect to both input and output sizes. The full Perceiver IO model achieves
strong results on tasks with highly structured output spaces, such as natural
language and visual understanding, StarCraft II, and multi-task and multi-modal
domains. As highlights, Perceiver IO matches a Transformer-based BERT baseline
on the GLUE language benchmark without the need for input tokenization and
achieves state-of-the-art performance on Sintel optical flow estimation.
Authors
Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira