Differential Transformer (Oct 2024)

109 subscribers

56 views

About
Share

Published On Oct 8, 2024

Title: Differential Transformer
Link: https://arxiv.org/abs/2410.05258
Date: 7 Oct 2024
Authors: Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei

Summary

This paper introduces the Differential Transformer, a new architecture for large language models (LLMs) that addresses the issue of attention noise, where Transformers overallocate attention to irrelevant information. The authors propose a differential attention mechanism, which uses the difference between two softmax attention maps to cancel out noise and encourage models to focus on critical information. Experimental results demonstrate that the Differential Transformer outperforms traditional Transformers in various tasks, including language modelling, long-context modelling, information retrieval, hallucination mitigation, and in-context learning. Notably, the Differential Transformer also reduces activation outliers, which can be beneficial for model quantization. The paper concludes by highlighting the promising potential of the Differential Transformer as a foundation architecture for future advancements in LLMs.

Key Topics

Differential Transform, Attention Noise, Long-context Modelling, Key Information Retrieval, Contextual Hallucination

Published On Oct 8, 2024

Share/Embed

Video Link