Differential attention improves performance on long contexts

202410091153 Status: #idea Tags: #ai #deep_learning #transformers #llm # Differential attention improves performance on long contexts As the context length increases, there are more irrelevant tokens which can accumulate noise using the standard transformer architecture. Suppose the score for our relevant token is $x_i$, and all other token scores $x_j$ with $i \neq j$ are irrelevant (i.e. noise). Then, as $j \to \infty$, since the score for $x_i$ is constant as we add tokens to the end of the sequence (since only previous tokens affect $x_i$ in causal self attention), we have, $\lim_{n \to \infty} \text{softmax}(x_i) = \lim_{n \to \infty} \frac{e^{x_i}}{\sum_{i=1}^n e^{x_j}} \to 0$ for all $x_i$ in $X$. Hence, the traditional transformer cannot determine relevant tokens on very long contexts. By contrast, the differential transformer forces noise tokens' scores towards 0. Thus, if $x_i$ is one of the $k$ relevant tokens, we have: $ \begin{align*} \lim_{n \to \infty} [\text{softmax}(x_{i_1}) - \lambda \cdot \text{softmax}(x_{i_2})] &= \lim_{n \to \infty} \left[\frac{e^{x_{i_1}}}{\sum_{i=1}^n e^{x_{j_1}}} - \lambda \cdot \frac{e^{x_{i_2}}}{\sum_{i=1}^n e^{x_{j_2}}}\right]\\ &= \lim_{n \to \infty} \left[ \frac{e^{x_{i_1}}}{\sum_{k_j \in |K_1|} e^{k_j}} - \lambda \frac{e^{{x_{i_2}}}}{\sum_{k_2 \in |K_2|} e^{k_j}} \right]\\ &= \frac{e^{x_{i_1}}}{\sum_{k_j \in |K_1|} e^{k_j}} - \lambda \frac{e^{{x_{i_2}}}}{\sum_{k_2 \in |K_2|} e^{k_j}} \end{align*} $ Thus, since the difference in the irrelevant tokens is pushed towards $0$, we are left with only the tokens that are considered relevant (i.e. are in $K_1$ or $K_2$, the sets of relevant tokens for attention map 1 and 2, respectively). This value does not depend on $n$, and so the attention score will not be altered as we add irrelevant tokens to the end of the sequence. ![[Pasted image 20241009145213.png]] --- # References [[Differential Transformer]]