Multi-head attention plays a crucial role in the recent success of
Transformer models, which leads to consistent performance improvements over
conventional attention in various applications. The popul
Transformer has achieved great success in the NLP field by composing various
advanced models like BERT and GPT. However, Transformer and its existing
variants may not be optimal in capturing token dis
Transformer is a type of self-attention-based neural networks originally
applied for NLP tasks. Recently, pure transformer-based models are proposed to
solve computer vision problems. These visual tra
We use neural ordinary differential equations to formulate a variant of the
Transformer that is depth-adaptive in the sense that an input-dependent number
of time steps is taken by the ordinary differ
We present a framework that abstracts Reinforcement Learning (RL) as a
sequence modeling problem. This allows us to draw upon the simplicity and
scalability of the Transformer architecture, and associ
Modern neural sequence generation models are built to either generate tokens
step-by-step from scratch or (iteratively) modify a sequence of tokens bounded
by a fixed length. In this work, we develop
Feature interactions across space and scales underpin modern visual recognition systems because they introduce beneficial visual contexts. Conventionally, spatial contexts are passively hidden in the
Vision-and-Language Pretraining (VLP) has improved performance on various
joint vision-and-language downstream tasks. Current approaches for VLP heavily
rely on image feature extraction processes, mos
Image generation has been successfully cast as an autoregressive sequence
generation or transformation problem. Recent work has shown that self-attention
is an effective way of modeling textual sequen
Transformer-based models have achieved state-of-the-art results in many natural language processing (NLP) tasks. The self-attention architecture allows us to combine information from all elements of a
Recent studies show that Transformer has strong capability of building
long-range dependencies, yet is incompetent in capturing high frequencies that
predominantly convey local information. To tackle
Large Transformer models yield impressive results on many tasks, but are
expensive to train, or even fine-tune, and so slow at decoding that their use
and study becomes out of reach. We address this p
Video transformers have recently emerged as an effective alternative to
convolutional networks for action classification. However, most prior video
transformers adopt either global space-time attentio
Modeling the parser state is key to good performance in transition-based
parsing. Recurrent Neural Networks considerably improved the performance of
transition-based systems by modelling the global st
We present a neat yet effective recursive operation on vision transformers
that can improve parameter utilization without involving additional parameters.
This is achieved by sharing weights across de
In this paper, we introduce a novel framework, called multi-level multi-scale point transformer(mlmspt)that works directly on the irregular point clouds for representation learning.
Specifically, a point pyramid transformer is investigated to model features with diverse resolutions or scales we defined, followed by a multi-level transformer module to aggregate contextual information from different levels of each scale and enhance their interactions.
Transformer-based models show their effectiveness across multiple domains and
tasks. The self-attention allows to combine information from all sequence
elements into context-aware representations. How
Convolutional Neural Networks define an exceptionally powerful class of
models, but are still limited by the lack of ability to be spatially invariant
to the input data in a computationally and parame