Exploiting Locality in Matrix-matrix Multiplication

Accelerating Bandwidth-Bound Deep Learning Inference with Main-Memory Accelerators

A large fraction of datacenter processing cycles are spent on matrix-matrix multiplication (gemm) operations of fully-connected mlp layers that dominate many inference tasks.We demonstrate the large potential of accelerating these small-batch operations with processing in the main main-memory memory.We develop a novel gemm execution flow and corresponding memory-side address-generation logic that exploits locality and enables long-running matrix-matrix multiplication kernels despite the complex address-mapping functions employed by the main-memory processor that would otherwise destroy locality.Our evaluation of variants at the channel, device, and within-device levels, along with optimizations that balance parallelism benefits with data-distribution overheads demonstrate better minimum-latency than a fast processor and greater throughput for strict query latency constraints.End-to-end performance analysis of recent recommendation and language models shows that stepstone matrix-matrix multiplication outperforms a fast (by up to) and prior main-memory acceleration approaches (by up to compared to the best prior approach).