We propose Token Turing Machines (TTM), a sequential, autoregressive
Transformer model with memory for real-world sequential visual understanding.
Our model is inspired by the seminal Neural Turing Machine, and has an external
memory consisting of a set of tokens which summarise the previous history
(i.e., frames). This memory is efficiently addressed, read and written using a
Transformer as the processing unit/controller at each step. The model's memory
module ensures that a new observation will only be processed with the contents
of the memory (and not the entire history), meaning that it can efficiently
process long sequences with a bounded computational cost at each step. We show
that TTM outperforms other alternatives, such as other Transformer models
designed for long sequences and recurrent neural networks, on two real-world
sequential visual understanding tasks: online temporal activity detection from
videos and vision-based robot action policy learning.
Authors
Michael S. Ryoo, Keerthana Gopalakrishnan, Kumara Kahatapitiya, Ted Xiao, Kanishka Rao, Austin Stone, Yao Lu, Julian Ibarz, Anurag Arnab