We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
A high-performance batched matrix multiplication framework for GPUs under unbalanced input distribution.
- Authors
Wang, Ruimin; Yang, Zhiwei; Xu, Hao; Lu, Lu
- Abstract
In the past few decades, general matrix multiplication (GEMM), as the basic component of the Basic Linear Algebra Subprograms (BLAS) library, has played a vital role in various fields such as machine learning, image processing, and fluid dynamics. Because these fields tend to deconstruct the problem into multiple smaller sub-problems, today's BLAS libraries have implemented batched GEMM routines to achieve high performance in this scenario. MAGMA proposes a vbatch routine to calculate batched GEMM with variable size on GPU, but unbalanced input will cause some workgroups and threads to be idle, thereby affecting performance. In addition, unbalanced input will also affect the load balancing of the Computing Unit in GPU, and extreme input will lead to insufficient utilization of hardware resources. In this paper we proposes a high-performance batched GEMM computing framework on GPU. For a large batch of small matrices with variable sizes and unbalanced distribution, the proposed framework considered the hardware architecture and the possible data distribution, and adopted three methods (flexible tile, sort-up and split-down) to improve hardware utilization and achieve better load balancing. Experimental results show that our framework has a 3.02× performance improvement compared to the latest MAGMA implementation on AMD Radeon Instinct MI50 GPU, and 3.14× speedup on MI100.
- Subjects
MATRIX multiplications; GRAPHICS processing units; LINEAR algebra; FLUID dynamics; IMAGE processing; DATA distribution
- Publication
Journal of Supercomputing, 2022, Vol 78, Issue 2, p1741
- ISSN
0920-8542
- Publication type
Academic Journal
- DOI
10.1007/s11227-021-03936-9