202410091452
Status: #idea
Tags: #ai #deep_learning #transformers #llm #information_retrieval
# Differential attention improves needle-in-haystack performance
We know that [[Differential attention reduces noise in the attention map]] and [[Differential attention improves performance on long contexts]]. These two together ensure that, even with a large number of tokens between the "needle" and the use prompt, the attention weight for the "needle" token(s) will be unaffected by the noise tokens.
![[Pasted image 20241009212452.png]]
---
# References
[[Differential Transformer]]