Published On Oct 8, 2024
Title: Differential Transformer
Link: https://arxiv.org/abs/2410.05258
Date: 7 Oct 2024
Authors: Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei
Summary
This paper introduces the Differential Transformer, a new architecture for large language models (LLMs) that addresses the issue of attention noise, where Transformers overallocate attention to irrelevant information. The authors propose a differential attention mechanism, which uses the difference between two softmax attention maps to cancel out noise and encourage models to focus on critical information. Experimental results demonstrate that the Differential Transformer outperforms traditional Transformers in various tasks, including language modelling, long-context modelling, information retrieval, hallucination mitigation, and in-context learning. Notably, the Differential Transformer also reduces activation outliers, which can be beneficial for model quantization. The paper concludes by highlighting the promising potential of the Differential Transformer as a foundation architecture for future advancements in LLMs.
Key Topics
Differential Transform, Attention Noise, Long-context Modelling, Key Information Retrieval, Contextual Hallucination