research-article

DASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector Multiplication

Authors:

Weifeng LiuAuthors Info & Claims

SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 73, Pages 1 - 14

https://doi.org/10.1145/3581784.3607051

Published: 11 November 2023 Publication History

Abstract

Sparse matrix-vector multiplication (SpMV) plays a key role in computational science and engineering, graph processing, and machine learning applications. Much work on SpMV was devoted to resolving problems such as random access to the vector x and unbalanced load. However, we have experimentally found that the computation of inner products still occupies much overhead in the SpMV operation, which has been largely ignored in existing work.

In this paper, we propose DASP, a new algorithm using specific dense MMA units for accelerating the compute part of general SpMV. We analyze the row-wise distribution of nonzeros and group the rows into three categories containing long, medium, and short rows, respectively. We then organize them into small blocks of proper sizes to meet the requirement of MMA computation. For the three categories, DASP offers different strategies to complete SpMV by efficiently utilizing the MMA units.

The experimental results on two newest NVIDIA GPUs A100 and H800 show that our DASP in FP64 precision outperforms five latest SpMV methods CSR5, TileSpMV, LSRB-CSR, cuSPARSE BSR format and cuSPARSE CSR format by a factor of on average 1.46x, 2.09x, 3.29x, 2.08x and 1.52x (up to 12.64x, 17.48x, 90.59x, 283.92x and 6.94x) on A100, respectively. As for SpMV in FP16 precision, our DASP outperforms cuSPARSE by a factor of on average 1.70x and 1.75x (up to 26.47x and 65.94x) on A100 and H800, respectively.

References

[1]

C. Alappat, A. Basermann, A. R. Bishop, H. Fehske, G. Hager, O. Schenk, J. Thies, and G. Wellein. A recursive algebraic coloring technique for hardware-efficient symmetric sparse matrix-vector multiplication. ACM Transactions on Parallel Computing, 7(3), 2020.

Digital Library

[2]

J. I. Aliaga, H. Anzt, T. Grützmacher, E. S. Quintana-Ortí, and A. E. Tomás. Compression and load balancing for efficient sparse matrix-vector product on multicore processors and graphics processing units. Concurrency and Computation: Practice and Experience, 34(14), 2022.

[3]

H. Anzt, T. Cojean, G. Flegar, F. Göbel, T. Grützmacher, P. Nayak, T. Ribizel, Y. M. Tsai, and E. S. Quintana-Ortí. Ginkgo: A modern linear operator algebra framework for high performance computing. ACM Transactions on Mathematical Software, 48(1), 2022.

Digital Library

[4]

H. Anzt, T. Cojean, C. Yen-Chen, J. Dongarra, G. Flegar, P. Nayak, S. Tomov, Y. M. Tsai, and W. Wang. Load-balancing sparse matrix vector product kernels on gpus. ACM Transactions on Parallel Computing, 7(1), 2020.

[5]

H. Anzt, S. Tomov, and J. Dongarra. On the performance and energy efficiency of sparse linear algebra on gpus. The International Journal of High Performance Computing Applications, 31(5), 2017.

Digital Library

[6]

A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarath, and P. Sadayappan. Fast sparse matrix-vector multiplication on gpus for graph applications. In SC '14, 2014.

Digital Library

[7]

A. Ashari, N. Sedaghati, J. Eisenlohr, and P. Sadayappan. A model-driven blocking strategy for load balanced sparse matrix-vector multiplication on gpus. Journal of Parallel and Distributed Computing, 76, 2015.

[8]

N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In SC '09, 2009.

Digital Library

[9]

A. Benatia, W. Ji, Y. Wang, and F. Shi. Sparse matrix format selection with multiclass svm for spmv on gpu. In ICPP '16, 2016.

[10]

H. Bian, J. Huang, L. Liu, D. Huang, and X. Wang. Albus: A method for efficiently processing spmv using simd and load balancing. Future Generation Computer Systems, 116, 2021.

[11]

P. Blanchard, N. J. Higham, F. Lopez, T. Mary, and S. Pranesh. Mixed precision block fused multiply-add: Error analysis and application to gpu tensor cores. SIAM Journal on Scientific Computing, 42(3), 2020.

[12]

A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson. Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks. In SPAA '09, 2009.

Digital Library

[13]

A. Buluç, S. Williams, L. Oliker, and J. Demmel. Reduced-bandwidth multithreaded algorithms for sparse matrix-vector multiplication. In IPDPS '11, 2011.

[14]

A. Buttari, V. Eijkhout, J. Langou, and S. Filippone. Performance optimization and modeling of blocked sparse kernels. The International Journal of High Performance Computing Applications, 21(4), 2007.

[15]

Z. Chen, Z. Qu, L. Liu, Y. Ding, and Y. Xie. Efficient tensor core-based gpu kernels for structured sparsity under reduced precision. In SC '21, 2021.

Digital Library

[16]

J. W. Choi, A. Singh, and R. W. Vuduc. Model-driven autotuning of sparse matrix-vector multiply on gpus. In PPoPP '10, 2010.

Digital Library

[17]

J. Choquette, W. Gandhi, O. Giroux, N. Stam, and R. Krashinsky. Nvidia a100 tensor core gpu: Performance and innovation. IEEE Micro, 41(2), 2021.

[18]

R. Chowdhury, F. Silvestri, and F. Vella. A computational model for tensor core units. In SPAA '20, 2020.

Digital Library

[19]

R. Chowdhury, F. Silvestri, and F. Vella. Algorithm design for tensor units. In Euro-Par '21, 2021.

Digital Library

[20]

Y.-H. Chung, C.-J. Shih, and S.-H. Hung. Accelerating simulated quantum annealing with gpu and tensor cores. In ISC '22, 2022.

Digital Library

[21]

M. Daga and J. L. Greathouse. Structural agnostic spmv: Adapting csr-adaptive for irregular matrices. In HiPC '15, 2015.

Digital Library

[22]

A. Dakkak, C. Li, J. Xiong, I. Gelado, and W.-m. Hwu. Accelerating reduction and scan using tensor core units. In ICS '19, 2019.

Digital Library

[23]

S. Dalton, L. Olson, and N. Bell. Optimizing sparse matrix-matrix multiplication for the gpu. ACM Transactions on Mathematical Software, 41(4), 2015.

[24]

T. A. Davis and Y. Hu. The university of florida sparse matrix collection. ACM Transactions on Mathematical Software, 38(1), 2011.

[25]

J. Demmel, J. Dongarra, V. Eijkhout, E. Fuentes, A. Petitet, R. Vuduc, R. C. Whaley, and K. Yelick. Self-adapting linear algebra algorithms and software. Proceedings of the IEEE, 93(2), 2005.

[26]

J. Domke, E. Vatai, A. Drozd, P. ChenT, Y. Oyama, L. Zhang, S. Salaria, D. Mukunoki, A. Podobas, M. WahibT, and S. Matsuoka. Matrix engines for high performance computing: A paragon of performance or grasping at straws? In IPDPS '21, 2021.

[27]

Z. Du, J. Li, Y. Wang, X. Li, G. Tan, and N. Sun. Alphasparse: Generating high performance spmv codes directly from sparse matrices. In SC '22, 2022.

[28]

S. Durrani, M. S. Chughtai, M. Hidayetoglu, R. Tahir, A. Dakkak, L. Rauchwerger, F. Zaffar, and W.-m. Hwu. Accelerating fourier and number theoretic transforms using tensor cores and warp shuffles. In PACT '21, 2021.

[29]

A. Elafrou, G. Goumas, and N. Koziris. Performance analysis and optimization of sparse matrix-vector multiplication on modern multi- and many-core processors. In ICPP '17, 2017.

[30]

A. Elafrou, G. Goumas, and N. Koziris. Conflict-free symmetric sparse matrix-vector multiplication on multicore architectures. In SC '19, 2019.

Digital Library

[31]

A. Elafrou, V. Karakasis, T. Gkountouvas, K. Kourtis, G. Goumas, and N. Koziris. Sparsex: A library for high-performance sparse matrix-vector multiplication on multicore platforms. ACM Transactions on Mathematical Software, 44(3), 2018.

Digital Library

[32]

B. Feng, Y. Wang, T. Geng, A. Li, and Y. Ding. Apnn-tc: Accelerating arbitrary precision neural networks on ampere gpu tensor cores. In SC '21, 2021.

Digital Library

[33]

S. Filippone, V. Cardellini, D. Barbieri, and A. Fanfarillo. Sparse matrix-vector multiplication on gpgpus. ACM Transactions on Mathematical Software, 43(4), 2017.

[34]

J. Finkelstein, J. S. Smith, S. M. Mniszewski, K. Barros, C. F. A. Negre, E. H. Rubensson, and A. M. N. Niklasson. Quantum-based molecular dynamics simulations using tensor cores. Journal of Chemical Theory and Computation, 17(10), 2021.

[35]

J. S. Firoz, A. Li, J. Li, and K. Barker. On the feasibility of using reduced-precision tensor core operations for graph analytics. In HPEC '20, 2020.

[36]

J. Gao, W. Ji, Z. Tan, Y. Wang, and F. Shi. Taichi: A hybrid compression format for binary sparse matrix-vector multiplication on gpu. IEEE Transactions on Parallel and Distributed Systems, 33(12), 2022.

[37]

C. Gómez, F. Mantovani, E. Focht, and M. Casas. Efficiently running spmv on long vector architectures. In PPoPP '21, 2021.

Digital Library

[38]

G. Goumas, K. Kourtis, N. Anastopoulos, V. Karakasis, and N. Koziris. Performance evaluation of the sparse matrix-vector multiplication on modern architectures. The Journal of Supercomputing, 50(1), 2009.

Digital Library

[39]

J. L. Greathouse and M. Daga. Efficient sparse matrix-vector multiplication on gpus using the csr storage format. In SC '14, 2014.

Digital Library

[40]

A. Haidar, S. Tomov, J. Dongarra, and N. J. Higham. Harnessing gpu tensor cores for fast fp16 arithmetic to speed up mixed-precision iterative refinement solvers. In SC '18, 2018.

Digital Library

[41]

K. Ho, H. Zhao, A. Jog, and S. Mohanty. Improving gpu throughput through parallel execution using tensor cores and cuda cores. In ISVLSI '22, 2022.

[42]

N.-M. Ho and W.-F. Wong. Tensorox: Accelerating gpu applications via neural approximation on unused tensor cores. IEEE Transactions on Parallel and Distributed Systems, 33(2), 2022.

Digital Library

[43]

G. Huang, H. Li, M. Qin, F. Sun, Y. Ding, and Y. Xie. Shfl-bw: Accelerating deep neural network inference with tensor-core aware weight pruning. In DAC '22, 2022.

Digital Library

[44]

E.-J. Im and K. Yelick. Optimizing sparse matrix computations for register reuse in sparsity. In ICCS '01, 2001.

[45]

E.-J. Im, K. Yelick, and R. Vuduc. Sparsity: Optimization framework for sparse matrix kernels. The International Journal of High Performance Computing Applications, 18(1), 2004.

[46]

H. Ji, H. Song, S. Lu, Z. Jin, G. Tan, and W. Liu. Tilespmspv: A tiled algorithm for sparse matrix-sparse vector multiplication on gpus. In ICPP '22, 2022.

Digital Library

[47]

Z. Ji and C.-L. Wang. Efficient exact k-nearest neighbor graph construction for billion-scale datasets using gpus with tensor cores. In ICS '22, 2022.

Digital Library

[48]

E. Karimi, N. B. Agostini, S. Dong, and D. Kaeli. Vcsr: An efficient gpu memory-aware sparse format. IEEE Transactions on Parallel and Distributed Systems, 33(12), 2022.

[49]

H. Kim, S. Ahn, Y. Oh, B. Kim, W. W. Ro, and W. J. Song. Duplo: Lifting redundant memory accesses of deep neural networks for gpu tensor cores. In MICRO '20, 2020.

[50]

K. Kourtis, V. Karakasis, G. Goumas, and N. Koziris. Csx: An extended compression format for spmv on shared memory systems. In PPoPP '11, 2011.

Digital Library

[51]

M. Kreutzer, G. Hager, G. Wellein, H. Fehske, and A. R. Bishop. A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide simd units. SIAM Journal on Scientific Computing, 36(5), 2014.

Digital Library

[52]

D. Langr and P. Tvrdík. Evaluation criteria for sparse matrix storage formats. IEEE Transactions on Parallel and Distributed Systems, 27(2), 2016.

[53]

S. Lee, S. Hwang, M. J. Kim, J. Choi, and J. H. Ahn. Future scaling of memory hierarchy for tensor cores and eliminating redundant shared memory traffic using inter-warp multicasting. IEEE Transactions on Computers, 71(12), 2022.

[54]

A. Li, T. Geng, T. Wang, M. Herbordt, S. L. Song, and K. Barker. Bstc: A novel binarized-soft-tensor-core design for accelerating bit-based approximated neural nets. In SC '19, 2019.

Digital Library

[55]

A. Li and S. Su. Accelerating binarized neural networks via bit-tensor-cores in turing gpus. IEEE Transactions on Parallel and Distributed Systems, 32(7), 2021.

[56]

B. Li, S. Cheng, and J. Lin. tcfft: A fast half-precision fft library for nvidia tensor cores. In CLUSTER '21, 2021.

[57]

G. Li, J. Xue, L. Liu, X. Wang, X. Ma, X. Dong, J. Li, and X. Feng. Unleashing the low-precision computation potential of tensor cores on gpus. In CGO '21, 2021.

Digital Library

[58]

J. Li, G. Tan, M. Chen, and N. Sun. Smat: An input adaptive auto-tuner for sparse matrix-vector multiplication. In PLDI '13, 2013.

Digital Library

[59]

K. Li, W. Yang, and K. Li. Performance analysis and optimization for spmv on gpu using probabilistic modeling. IEEE Transactions on Parallel and Distributed Systems, 26(1), 2014.

[60]

S. Li, K. Osawa, and T. Hoefler. Efficient quantized sparse matrix operations on tensor cores. In SC '22, 2022.

[61]

W. Li, H. Cheng, Z. Lu, y. Lu, and W. Liu. Haspmv: Heterogeneity-aware sparse matrix-vector multiplication on modern asymmetric multicore processors. In CLUSTER '23, 2023.

[62]

C. Liu, B. Xie, X. Liu, W. Xue, H. Yang, and X. Liu. Towards efficient spmv on sunway manycore architectures. In ICS '18, 2018.

Digital Library

[63]

L. Liu, M. Liu, C. Wang, and J. Wang. Lsrb-csr: A low overhead storage format for spmv on the gpu systems. In ICPADS '15, 2015.

[64]

W. Liu and B. Vinter. Csr5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In ICS '15, 2015.

Digital Library

[65]

W. Liu and B. Vinter. Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors. Parallel Computing, 49(C), 2015.

[66]

X. Liu, Y. Liu, H. Yang, J. Liao, M. Li, Z. Luan, and D. Qian. Toward accelerated stencil computation by adapting tensor core unit on gpu. In ICS '22, 2022.

Digital Library

[67]

X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey. Efficient sparse matrix-vector multiplication on x86-based many-core processors. In ICS '13, 2013.

Digital Library

[68]

Z. Lu and W. Liu. Tilesptrsv: a tiled algorithm for parallel sparse triangular solve on gpus. CCF Transactions on High Performance Computing, 5, 2023.

[69]

S. Markidis, S. W. D. Chien, E. Laure, I. B. Peng, and J. S. Vetter. Nvidia tensor core programmability, performance & precision. In IPDPSW '18, 2018.

[70]

M. Martineau, P. Atkinson, and S. McIntosh-Smith. Benchmarking the nvidia v100 gpu and tensor cores. In Euro-Par '18, 2019.

Digital Library

[71]

M. Martone. Efficient multithreaded untransposed, transposed or symmetric sparse matrix-vector multiplication with the recursive sparse blocks format. Parallel Computing, 40(7), 2014.

[72]

J. D. McCalpin. Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture Newsletter, 2(19--25), 1995.

[73]

D. Merrill and M. Garland. Merge-based parallel sparse matrix-vector multiplication. In SC '16, 2016.

[74]

H. Mi, X. Yu, X. Yu, S. Wu, and W. Liu. Balancing computation and communication in distributed sparse matrix-vector multiplication. In CCGrid '23, 2023.

[75]

D. Mukunoki, K. Ozaki, T. Ogita, and T. Imamura. Dgemm using tensor cores, and its accurate and reproducible versions. In ISC '20, 2020.

Digital Library

[76]

Y. Niu, Z. Lu, M. Dong, Z. Jin, W. Liu, and G. Tan. Tilespmv: A tiled algorithm for sparse matrix-vector multiplication on gpus. In IPDPS '21, 2021.

[77]

Y. Niu, Z. Lu, H. Ji, S. Song, Z. Jin, and W. Liu. Tilespgemm: A tiled algorithm for parallel sparse general matrix-matrix multiplication on gpus. In PPoPP '22, 2022.

Digital Library

[78]

R. Nobre, A. Ilic, S. Santander-Jiménez, and L. Sousa. Exploring the binary precision capabilities of tensor cores for epistasis detection. In IPDPS '20, 2020.

[79]

H. Ootomo and R. Yokota. Recovering single precision accuracy from tensor cores while surpassing the fp32 theoretical peak performance. The International Journal of High Performance Computing Applications, 36(4), 2022.

Digital Library

[80]

L. Pisha and . Ligowski. Accelerating non-power-of-2 size fourier transforms with gpu tensor cores. In IPDPS '21, 2021.

[81]

F. A. Quezada, C. A. Navarro, N. Hitschfeld, and B. Bustos. Squeeze: Efficient compact fractals for tensor core gpus. Future Generation Computer Systems, 135, 2022.

[82]

N. Sedaghati, T. Mu, L. Pouchet, S. Parthasarathy, and P. Sadayappan. Automatic selection of sparse matrix representation on gpus. In ICS '15, 2015.

Digital Library

[83]

Z. Song, J. Wang, T. Li, L. Jiang, J. Ke, X. Liang, and N. Jing. Gpnpu: Enabling efficient hardware-based direct convolution with multi-precision support in gpu tensor cores. In DAC '20, 2020.

[84]

M. Steinberger, R. Zayer, and H. Seidel. Globally homogeneous, locally adaptive sparse matrix-vector multiplication on the gpu. In ICS '17, 2017.

Digital Library

[85]

W. Sun, A. Li, T. Geng, S. Stuijk, and H. Corporaal. Dissecting tensor cores via microbenchmarks: Latency, throughput and numeric behaviors. IEEE Transactions on Parallel and Distributed Systems, 34(1), 2023.

[86]

W. Sun, S. Sioutas, S. Stuijk, A. Nelson, and H. Corporaal. Efficient tensor cores support in tvm for low-latency deep learning. In DATE '21, 2021.

[87]

Y. Sun, L. Zheng, Q. Wang, X. Ye, Y. Huang, P. Yao, X. Liao, and H. Jin. Accelerating sparse deep neural network inference using gpu tensor cores. In HPEC '22, 2022.

[88]

G. Tan, J. Liu, and J. Li. Design and implementation of adaptive spmv library for multicore and many-core architecture. ACM Transactions on Mathematical Software, 44(4), 2018.

Digital Library

[89]

J. Tu, M. A. Clark, C. Jung, and R. D. Mawhinney. Solving dwf dirac equation using multi-splitting preconditioned conjugate gradient with tensor cores on nvidia gpus. In PASC '21, 2021.

Digital Library

[90]

N. Tukanov, R. Srinivasaraghavan, J. E. Moreira, and T. M. Low. Modeling matrix engines for portability and performance. In IPDPS '22, 2022.

[91]

R. Vuduc, J. W. Demmel, and K. A. Yelick. Oski: A library of automatically tuned sparse matrix kernels. Journal of Physics: Conference Series, 16(1), 2005.

[92]

R. Vuduc, J. W. Demmel, K. A. Yelick, S. Kamil, R. Nishtala, and B. Lee. Performance optimizations and bounds for sparse matrix-vector multiply. In SC '02, 2002.

[93]

R. W. Vuduc and H.-J. Moon. Fast sparse matrix-vector multiplication by exploiting variable block structure. In HPCC '05, 2005.

[94]

Y. Wang, B. Feng, and Y. Ding. Qgtc: Accelerating quantized graph neural networks via gpu tensor core. In PPoPP '22, 2022.

Digital Library

[95]

S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Computing, 35(3), 2009.

[96]

B. Xie, J. Zhan, X. Liu, W. Gao, Z. Jia, X. He, and L. Zhang. Cvr: Efficient vectorization of spmv on x86 processors. In CGO '18, 2018.

Digital Library

[97]

D. Yan, W. Wang, and X. Chu. Demystifying tensor cores to optimize half-precision matrix multiply. In IPDPS '20, 2020.

[98]

S. Yan, C. Li, Y. Zhang, and H. Zhou. yaspmv: Yet another spmv framework on gpus. In PPoPP '14, 2014.

Digital Library

[99]

W. Yang, K. Li, Z. Mo, and K. Li. Performance optimization using partitioned spmv on gpus and multicore cpus. IEEE Transactions on Computers, 64(9), 2014.

[100]

X. Yang, S. Parthasarathy, and P. Sadayappan. Fast sparse matrix-vector multiplication on gpus: Implications for graph mining. Proceedings of the VLDB Endowment, 4(4), 2011.

Digital Library

[101]

S. Yesil, A. Heidarshenas, A. Morrison, and J. Torrellas. Wise: Predicting the performance of sparse matrix vector multiplication with machine learning. In PPoPP '23, 2023.

Digital Library

[102]

X. You, C. Liu, H. Yang, P. Wang, Z. Luan, and D. Qian. Vectorizing spmv by exploiting dynamic regular patterns. In ICPP '22, 2022.

Digital Library

[103]

A. N. Yzelman and R. H. Bisseling. Cache-oblivious sparse matrix-vector multiplication by using sparse matrix partitioning methods. SIAM Journal on Scientific Computing, 31(4), 2009.

Digital Library

[104]

A. N. Yzelman and R. H. Bisseling. Two-dimensional cache-oblivious sparse matrix-vector multiplication. Parallel Computing, 37(12), 2011.

[105]

A. N. Yzelman and D. Roose. High-level strategies for parallel shared-memory sparse matrix-vector multiplication. IEEE Transactions on Parallel and Distributed Systems, 25(1), 2014.

[106]

O. Zachariadis, N. Satpute, J. Gómez-Luna, and J. Olivares. Accelerating sparse matrix-matrix multiplication with gpu tensor cores. Computers & Electrical Engineering, 88, 2020.

[107]

Y. Zhang, S. Li, F. Yuan, D. Dong, X. Yang, T. Li, and Z. Wang. Memory-aware optimization for sequences of sparse matrix-vector multiplications. In IPDPS '23, 2023.

[108]

Y. Zhao, J. Li, C. Liao, and X. Shen. Bridging the gap between deep learning and sparse matrix format selection. In PPoPP '18, 2018.

Digital Library

[109]

Y. Zhao, W. Zhou, X. Shen, and G. Yiu. Overhead-conscious format selection for spmv-based applications. In IPDPS '18, 2018.

Cited By

Chen YYu J(2024)Bitmap-Based Sparse Matrix-Vector Multiplication with Tensor CoresProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673055(1135-1144)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673055
Guo JXia RLiu JZhu XZhang X(2024)CAMLB-SpMV: An Efficient Cache-Aware Memory Load-Balancing SpMV on CPUProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673042(640-649)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673042
Li WCheng HLu ZLu YLiu W(2023)HASpMV: Heterogeneity-Aware Sparse Matrix-Vector Multiplication on Modern Asymmetric Multicore Processors2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00025(209-220)Online publication date: 31-Oct-2023
https://doi.org/10.1109/CLUSTER52292.2023.00025

Index Terms

DASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector Multiplication
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. Shared memory algorithms
      2. Vector / streaming algorithms
2. Mathematics of computing
  1. Mathematical software
    1. Mathematical software performance

Recommendations

Model-driven autotuning of sparse matrix-vector multiply on GPUs
PPoPP '10

We present a performance model-driven framework for automated performance tuning (autotuning) of sparse matrix-vector multiply (SpMV) on systems accelerated by graphics processing units (GPU). Our study consists of two parts.

First, we describe several ...
Model-driven autotuning of sparse matrix-vector multiply on GPUs
PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

We present a performance model-driven framework for automated performance tuning (autotuning) of sparse matrix-vector multiply (SpMV) on systems accelerated by graphics processing units (GPU). Our study consists of two parts.

First, we describe several ...
Adaptive Sparse Matrix-Vector Multiplication on CPU-GPU Heterogeneous Architecture
HPCCT '19: Proceedings of the 2019 3rd High Performance Computing and Cluster Technologies Conference

SpMV is the core algorithm in solving the sparse linear equations, which is widely used in many research and engineering application field. GPU is the most common coprocessor in high-performance computing domain, and has already been proven to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2023

1428 pages

ISBN:9798400701092

DOI:10.1145/3581784

Chair:
Dorian Arnold,
Program Chair:
Rosa M Badia,
Program Co-chair:
Kathryn Mohror

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 November 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Conference

SC '23

Sponsor:

SIGHPC

SC '23: International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 17, 2023

CO, Denver, USA

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
797
Total Downloads

Downloads (Last 12 months)797
Downloads (Last 6 weeks)64

Reflects downloads up to 15 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chen YYu J(2024)Bitmap-Based Sparse Matrix-Vector Multiplication with Tensor CoresProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673055(1135-1144)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673055
Guo JXia RLiu JZhu XZhang X(2024)CAMLB-SpMV: An Efficient Cache-Aware Memory Load-Balancing SpMV on CPUProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673042(640-649)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673042
Li WCheng HLu ZLu YLiu W(2023)HASpMV: Heterogeneity-Aware Sparse Matrix-Vector Multiplication on Modern Asymmetric Multicore Processors2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00025(209-220)Online publication date: 31-Oct-2023
https://doi.org/10.1109/CLUSTER52292.2023.00025

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents