202410091134 Status: #idea Tags: #ai #deep_learning #transformers #llm # Differential attention reduces noise in the attention map Differential attention computes *two* attention maps per head. It does this by creating two query and key matrices, $Q_1, Q_2$ and $K_1, K_2$, where each of these has dimensions $n \times d$ where $n$ is the number of tokens and $d$ is the hidden dimension of the model. The value matrix has dimensions $n \times 2d$, so it has twice the hidden dimension of the query and key matrices. To compute differential attention, we take: $ (\text{softmax}(Q_1 K_1^T / \sqrt{d}) - \lambda \cdot \text{softmax}(Q_2 K_2^T / \sqrt{d})) \cdot V $ where $\lambda$ is a learned scalar parameter. The dynamics of gradient descent encourage the two attention maps to pay attention to (i.e. assign high scores to) *different tokes in the input sequence*. This occurs because, if the two attention maps paid attention to the same tokens, the two attention scores would cancel out in the difference and the model would not be able to make any meaningful predictions. Thus, the relevant tokens should have very different scores between the two maps (since one map will pay attention to some subset of relevant tokens while the other map will ignore that same subset). On the flip side, the attention scores given to irrelevant tokens (i.e. noise) should be roughly the same for each attention map. This is true assuming that, conditional on the input, the noise is random or independent in nature. If this assumption holds, then the difference in attention maps for irrelevant tokens will push those scores towards $0$, as required. ![[Pasted image 20241009114318.png]] --- # References [[Differential Transformer]]