Differential attention improves needle-in-haystack performance

202410091452 Status: #idea Tags: #ai #deep_learning #transformers #llm #information_retrieval # Differential attention improves needle-in-haystack performance We know that [[Differential attention reduces noise in the attention map]] and [[Differential attention improves performance on long contexts]]. These two together ensure that, even with a large number of tokens between the "needle" and the use prompt, the attention weight for the "needle" token(s) will be unaffected by the noise tokens. ![[Pasted image 20241009212452.png]] --- # References [[Differential Transformer]]