We introduce token merging (tome), a simple method to increase the throughput of existing vit models without needing to train.
Tome gradually combines similar tokens in a transformer using a general and light-weight matchingalgorithm that is as fast as pruning while being more accurate.
Off-the-shelf, tome can 2x the throughput of state-of-the-art state-of-the-art models on images and 2.2x the throughput of state-of-the-art models on video with only a 0.2-0.3% accuracy drop in each case.
Training with tome further minimizes accuracy drop, leading to 2x the throughput of state-of-the-art models on audio for only a 0.4% mapper drop.
Authors
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, Judy Hoffman