We propose a suite of heterogeneous and flexible models, namely flexibert, that have varied encoder layers with a diverse set of possible operations and different hidden dimensions throughout the network.
For better-posed surrogate
modeling in this expanded design space, we propose a new graph-similarity-based embedding scheme.
We theoretically predict the existence of an embedding rank bottleneck that limits the contribution of self-attention width to the transformer expressivity.
We empirically demonstrate the existence of this rank bottleneck and its implications on the depth-to-width interplay of transformerarchitectures, linking the architecture variability across domains to the often glossed-over usage of different vocabulary sizes or embedding ranks in different domains.
Transformer architectures for graphs emerged as an alternative to established techniques for machine learning with graphs, such as graph neural networks, often attributed to their ability to circumvent graph neural networks'shortcomings, such as over-smoothing and over-squashing.
Here, we derive a taxonomy of graph transformer architectures, bringing some order to this emerging field.