We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
Hierarchical parallelization and optimization of high-order stencil computations on multicore clusters.
- Authors
Dursun, Hikmet; Kunaseth, Manaschai; Nomura, Ken-ichi; Chame, Jacqueline; Lucas, Robert; Chen, Chun; Hall, Mary; Kalia, Rajiv; Nakano, Aiichiro; Vashishta, Priya
- Abstract
We present a scalable parallelization scheme for high-order stencil computations that also optimizes memory behavior on multicore clusters. Our multilevel approach combines: (i) inter-node parallelization via spatial decomposition; (ii) inter-core parallelization via multithreading and explicit non-uniform memory access (NUMA) control; (iii) data locality optimizations through auto-tuned tiling for efficient use of hierarchical memory; and (iv) register blocking and data parallelism via single-instruction multiple-data techniques to utilize registers and exploit data locality. The scheme is applied to a sixth-order stencil based finite-difference time-domain code. Weak-scaling parallel efficiency is over 98 % on 32,768 BlueGene/P processors. Multithreading with explicit NUMA control attains 9.9-fold speedup on a dual 12-core AMD Opteron system. Data locality optimizations achieve 7.7-fold reduction of the last level cache miss rate of Intel Nehalem, whereas register blocking increases data parallelism and thereby achieves 5.9 Gflops performance on a single core. Register blocking + multithreading optimizations achieve 5.8-fold speedup on a single quadcore Nehalem.
- Subjects
HIERARCHICAL storage management (Computers); MULTICORE processors; NON-uniform memory access; THREADS (Computer programs); RANDOM access memory; FINITE difference time domain method
- Publication
Journal of Supercomputing, 2012, Vol 62, Issue 2, p946
- ISSN
0920-8542
- Publication type
Article
- DOI
10.1007/s11227-012-0764-z