A Bidirectional Gated State-Space Model for NLP Pretraining
Pretraining Without Attention
Model Variants.
(
Recently developed routing layers based on state-space models (ssm) and modelarchitectures based on multiplicative gating have a large impact on pretraining accuracy in natural language processing (nlp).
The proposed bidirectional gated ssm (bigs) replicates benchmark pretraining results without attention and can be extended to long-form pretraining without approximation.
Authors
Junxiong Wang, Jing Nathan Yan, Albert Gu, Alexander M. Rush