In this technical report, the systems we submitted for subtask 1B of the
DCASE 2021 challenge, regarding audiovisual scene classification, are described
in detail. They are essentially multi-source tr
We propose a suite of heterogeneous and flexible models, namely flexibert, that have varied encoder layers with a diverse set of possible operations and different hidden dimensions throughout the network.
For better-posed surrogate
modeling in this expanded design space, we propose a new graph-similarity-based embedding scheme.
This paper presents a way of doing large scale audio understanding without
traditional state of the art neural architectures. Ever since the introduction
of deep learning for understanding audio signa
We introduce a simple and lightweight method to produce a class of human-readable, realistic adversarial examples for language models.
We perform exhaustive experimentations of our algorithm on four transformer-based architectures, across a variety of downstream tasks, as well as under varying concentrations of said examples.
We theoretically predict the existence of an embedding rank bottleneck that limits the contribution of self-attention width to the transformer expressivity.
We empirically demonstrate the existence of this rank bottleneck and its implications on the depth-to-width interplay of transformerarchitectures, linking the architecture variability across domains to the often glossed-over usage of different vocabulary sizes or embedding ranks in different domains.
To improve the security and reliability of wind energy production, short-term
forecasting has become of utmost importance. This study focuses on multi-step
spatio-temporal wind speed forecasting for t
Very recently, a variety of vision transformer architectures for dense
prediction tasks have been proposed and they show that the design of spatial
attention is critical to their success in these task
Recent progress in natural language processing has been driven by advances in
both model architecture and model pretraining. Transformer architectures have
facilitated building higher-capacity models
The Transformer architecture is superior to RNN-based models in computational
efficiency. Recently, GPT and BERT demonstrate the efficacy of Transformer
models on various NLP tasks using pre-trained l
Modeling the parser state is key to good performance in transition-based
parsing. Recurrent Neural Networks considerably improved the performance of
transition-based systems by modelling the global st
We investigate the use of a pure transformerarchitecture (i.e., one with no convolutional backbones for feature extraction) for the problem of 2d body pose estimation.
We evaluate two pure transformer architectures on the coco dataset.
Transformers have achieved remarkable performance in widespread fields,
including natural language processing, computer vision and graph mining.
However, in the knowledge graph representation, where t
A central mechanism in machine learning is to identify, store, and recognize patterns. How to learn, access, and retrieve such patterns is crucial in Hopfield networks and the more recent transformer
Everyone wants to write beautiful and correct text, yet the lack of language
skills, experience, or hasty typing can result in errors. By employing the
recent advances in transformer architectures, we
Transformer architectures for graphs emerged as an alternative to established techniques for machine learning with graphs, such as graph neural networks, often attributed to their ability to circumvent graph neural networks'shortcomings, such as over-smoothing and over-squashing.
Here, we derive a taxonomy of graph transformer architectures, bringing some order to this emerging field.
This document aims to be a self-contained, mathematically precise overview of transformers architectures and algorithms.
It covers what transformers are, how they are trained, what they are used for, their keyarchitectural components, and a preview of the most prominent models.
While Transformer architectures have show remarkable success, they are bound
to the computation of all pairwise interactions of input element and thus
suffer from limited scalability. Recent work has
We propose a sparse attention scheme, dubbed k-nn attention, for boosting vision transformers.
Instead of involving all the image patches (tokens) for attention matrix calculation, we only select the top-k similar tokens from the keys for each query to compute the attentionmap.
Motivated by the success of Transformers in natural language processing (NLP)
tasks, there emerge some attempts (e.g., ViT and DeiT) to apply Transformers to
the vision domain. However, pure Transform