Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity - 42Papers