We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
SWattention: designing fast and memory-efficient attention for a new Sunway Supercomputer.
- Authors
Wu, Ruohan; Zhu, Xianyu; Chen, Junshi; Liu, Sha; Zheng, Tianyu; Liu, Xin; An, Hong
- Abstract
In the past few years, Transformer-based large language models (LLM) have become the dominant technology in a series of applications. To scale up the sequence length of the Transformer, FlashAttention is proposed to compute exact attention with reduced memory requirements and faster execution. However, implementing the FlashAttention algorithm on the new generation Sunway Supercomputer faces many constraints such as the unique heterogeneous architecture and the limited memory bandwidth. This work proposes SWattention, a highly efficient method for computing the exact attention on the SW26010pro processor. To fully utilize the 6 core groups (CG) and 64 cores per CG on the processor, we design a two-level parallel task partition strategy. Asynchronous memory access is employed to ensure that memory access overlaps with computation. Additionally, a tiling strategy is introduced to determine optimal SRAM block sizes. Compared with the standard attention, SWattention achieves around 2.0x speedup for FP32 training and 2.5x speedup for mixed-precision training. The sequence lengths range from 1k to 8k and scale up to 16k without being out of memory. As for the end-to-end performance, SWattention achieves up to 1.26x speedup for training GPT-style models, which demonstrates that SWattention enables longer sequence length for LLM training.
- Subjects
LANGUAGE models; SUPERCOMPUTERS; ARTIFICIAL intelligence
- Publication
Journal of Supercomputing, 2024, Vol 80, Issue 10, p13657
- ISSN
0920-8542
- Publication type
Article
- DOI
10.1007/s11227-024-05890-8