textitUser: An Efficient Sparse Transformer for Long Sequence Modeling
Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for Long Sequences
We propose \textit{diffuser}, a new state-of-the-art efficient transformer for long sequence modeling.
The key idea is to expand the receptive field of sparse attention using attention diffusion, which computes multi-hop token correlations based on all paths between corresponding disconnected tokens, besides attention among nearest and nearest neighbor tokens.
Evaluation results show that diffuser achieves improvements by an average of 0.94% on text classification tasks and 2.30% on long range arena, with memory savingscompared to state-of-the-art benchmarks, which demonstrates superior performance of diffuser in both expressiveness and efficiency aspects.
Transformers designed for sequential data have revolutionized the field of natural language processing (nlp), and have recently made tremendous impact in graph learning and computer vision.
However, the self-attention used by regular transformers comes with a quadratic time and memory complexity @xmath0 for input sequence of length @xmath1, which prevents the application of transformers to longer sequences in practical settings with limited computational resources.
Recently, many efficient transformers that improve computational efficiency have emerged.
However, the improved computation efficiency always sacrifices expressiveness due to the following challenges :.
A sparse-attention layer only focuses on neighboring tokens, resulting in slower information propagation in the attention graph.
Consequently, to model these crucial long-range correlations, sparse transformers require more layers to expand the receptive field compared to full-attentions.
Some existing works deal with the slow propagation by introducing global attentions for important tokens, which alleviates the problem of long-range interactions.
However, such sparsity-based approach can be lossy or even misleading in capturing important token correlations when they are not directly connected.
Result
In this work, we propose an efficient universal approximator for long sequence modeling that applies multi-hop attention diffusion.
We theoretically show that diffuser is a more efficient universal approximation for sequence modeling, with better expander properties from the graph spectral perspective.
Experimentally, we showed that diffuser achieves superior performance in language modeling, image modeling, and other long sequence modeling tasks.