Abstract
Attention based models such as Transformers involve pairwise interactions
between data points, modeled with a learnable attention matrix. Importantly,
this attention matrix is normalized with the SoftMax operator, which makes it
row-wise stochastic. In this paper, we propose instead to use Sinkhorn's
algorithm to make attention matrices doubly stochastic. We call the resulting
model a Sinkformer. We show that the row-wise stochastic attention matrices in
classical Transformers get close to doubly stochastic matrices as the number of
epochs increases, justifying the use of Sinkhorn normalization as an
informative prior. On the theoretical side, we show that, unlike the SoftMax
operation, this normalization makes it possible to understand the iterations of
self-attention modules as a discretized gradient-flow for the Wasserstein
metric. We also show in the infinite number of samples limit that, when
rescaling both attention matrices and depth, Sinkformers operate a heat
diffusion. On the experimental side, we show that Sinkformers enhance model
accuracy in vision and natural language processing tasks. In particular, on 3D
shapes classification, Sinkformers lead to a significant improvement.
Users
Please
log in to take part in the discussion (add own reviews or comments).