Differentiable Dynamic Programming for Structured Prediction and Attention

Dynamic programming (DP) solves a variety of structured combinatorial
problems by iteratively breaking them down into smaller subproblems. In spite
of their versatility, DP algorithms are usually non-differentiable, which
hampers their use as a layer in neural networks trained by backpropagation. To
address this issue, we propose to smooth the max operator in the dynamic
programming recursion, using a strongly convex regularizer. This allows to
relax both the optimal value and solution of the original combinatorial
problem, and turns a broad class of DP algorithms into differentiable
operators. Theoretically, we provide a new probabilistic perspective on
backpropagating through these DP operators, and relate them to inference in
graphical models. We derive two particular instantiations of our framework, a
smoothed Viterbi algorithm for sequence prediction and a smoothed DTW algorithm
for time-series alignment. We showcase these instantiations on two structured
prediction tasks and on structured and sparse attention for neural machine
translation.