skip to main content
10.1145/3458817.3476177acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article
Open access

Accelerating applications using edge tensor processing units

Published: 13 November 2021 Publication History

Abstract

Neural network (NN) accelerators have been integrated into a wide-spectrum of computer systems to accommodate the rapidly growing demands for artificial intelligence (AI) and machine learning (ML) applications. NN accelerators share the idea of providing native hardware support for operations on multidimensional tensor data. Therefore, NN accelerators are theoretically tensor processors that can improve system performance for any problem that uses tensors as inputs/outputs. Unfortunately, commercially available NN accelerators only expose computation capabilities through AI/ML-specific interfaces. Furthermore, NN accelerators reveal very few hardware design details, so applications cannot easily leverage the tensor operations NN accelerators provide.
This paper introduces General-Purpose Computing on Tensor Processing Units (GPTPU), an open-source, open-architecture framework that allows the developer and research communities to discover opportunities that NN accelerators enable for applications. GPTPU includes a powerful programming interface with efficient runtime system-level support---similar to that of CUDA/OpenCL in GPGPU computing---to bridge the gap between application demands and mismatched hardware/software interfaces.
We built GPTPU machine uses Edge Tensor Processing Units (Edge TPUs), which are widely available and representative of many commercial NN accelerators. We identified several novel use cases and revisited the algorithms. By leveraging the underlying Edge TPUs to perform tensor-algorithm-based compute kernels, our results reveal that GPTPU can achieve a 2.46× speedup over high-end CPUs and reduce energy consumption by 40%.

Supplementary Material

MP4 File (Accelerating Applications using Edge Tensor Processing Units.mp4.mp4)
Presentation video

References

[1]
Google LLC, "Coral M.2 accelerator datasheet." https://coral.withgoogle.com/static/files/Coral-M2-datasheet.pdf, 2019.
[2]
Apple, "Small chip. Giant leap." https://www.apple.com/mac/m1/.
[3]
S. Ryoo, C. I. Rodrigues, S. S. Baghsorkhi, S. S. Stone, D. B. Kirk, and W.-m. W. Hwu, "Optimization principles and application performance evaluation of a multithreaded GPU using CUDA," in Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, pp. 73--82, 2008.
[4]
V. Volkov and J. W. Demmel, "Benchmarking gpus to tune dense linear algebra," in SC'08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pp. 1--11, IEEE, 2008.
[5]
M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick, S. Morton, E. Phillips, Y. Zhang, and V. Volkov, "Parallel computing experiences with cuda," IEEE micro, vol. 28, no. 4, pp. 13--27, 2008.
[6]
S. Lee, S.-J. Min, and R. Eigenmann, "Openmp to gpgpu: a compiler framework for automatic translation and optimization," ACM Sigplan Notices, vol. 44, no. 4, pp. 101--110, 2009.
[7]
V. Narasiman, M. Shebanow, C. J. Lee, R. Miftakhutdinov, O. Mutlu, and Y. N. Patt, "Improving gpu performance via large warps and two-level warp scheduling," in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 308--317, 2011.
[8]
Y. Yang, P. Xiang, J. Kong, and H. Zhou, "A gpgpu compiler for memory optimization and parallelism management," ACM Sigplan Notices, vol. 45, no. 6, pp. 86--97, 2010.
[9]
S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S.-Z. Ueng, J. A. Stratton, and W.-m. W. Hwu, "Program optimization space pruning for a multithreaded gpu," in Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization, pp. 195--204, 2008.
[10]
M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan, "A compiler framework for optimization of affine loop nests for gpgpus," in Proceedings of the 22nd annual international conference on Supercomputing, pp. 225--234, 2008.
[11]
E. Z. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen, "On-the-fly elimination of dynamic irregularities for gpu computing," ACM SIGPLAN Notices, vol. 46, no. 3, pp. 369--380, 2011.
[12]
T. B. Jablin, P. Prabhu, J. A. Jablin, N. P. Johnson, S. R. Beard, and D. I. August, "Automatic cpu-gpu communication management and optimization," in Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation, pp. 142--151, 2011.
[13]
NVIDIA Corporation, "CUDA C programming guide v6.0." http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf, 2014.
[14]
Khronos Group, "OpenCL." http://www.khronos.org/opencl/.
[15]
M. Yan, L. Deng, X. Hu, L. Liang, Y. Feng, X. Ye, Z. Zhang, D. Fan, and Y. Xie, "HyGCN: A GCN accelerator with hybrid architecture," in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 15--29, 2020.
[16]
C. Zhang, Z. Fang, P. Zhou, P. Pan, and J. Cong, "Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks," in 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), IEEE Press.
[17]
S. Wang, D. Zhou, X. Han, and T. Yoshimura, "Chain-NN: An energy-efficient 1D chain architecture for accelerating deep convolutional neural networks," in Design, Automation Test in Europe Conference Exhibition (DATE), 2017, DATE, '17, pp. 1032--1037, IEEE Press, 2017.
[18]
N. Srivastava, H. Jin, S. Smith, H. Rong, D. Albonesi, and Z. Zhang, "Tensaurus: A versatile accelerator for mixed sparse-dense tensor computations," in IEEE International Symposium on High Performance Computer Architecture (HPCA), HPCA '20, pp. 689--702, IEEE Press, 2020.
[19]
Y. Chen, J. Emer, and V. Sze, "Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks," in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 367--379, 2016.
[20]
M. Gao, X. Yang, J. Pu, M. Horowitz, and C. Kozyrakis, "Tangram: Optimized coarse-grained dataflow for scalable nn accelerators," in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '19, (New York, NY, USA), p. 807--820, Association for Computing Machinery, 2019.
[21]
T. Moreau, M. Wyse, J. Nelson, A. Sampson, H. Esmaeilzadeh, L. Ceze, and M. Oskin, "SNNAP: Approximate computing on programmable SoCs via neural acceleration," in 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), pp. 603--614, 2015.
[22]
L. Song, F. Chen, Y. Zhuo, X. Qian, H. Li, and Y. Chen, "AccPar: Tensor partitioning for heterogeneous deep learning accelerators," in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 342--355, 2020.
[23]
X. Wei, C. H. Yu, P. Zhang, Y. Chen, Y. Wang, H. Hu, Y. Liang, and J. Cong, "Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs," in 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC), DAC, '17, (New York, NY, USA), IEEE Press, 2017.
[24]
H. Kung, B. McDanel, and S. Q. Zhang, "Packing sparse convolutional neural networks for efficient systolic array implementations: Column combining under joint optimization," in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '19, (New York, NY, USA), p. 821--834, Association for Computing Machinery, 2019.
[25]
T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, "DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning," SIGARCH Comput. Archit. News, vol. 42, pp. 269--284, Feb. 2014.
[26]
Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, "DaDianNao: A machine-learning supercomputer," in 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609--622, 2014.
[27]
H. Kwon, A. Samajdar, and T. Krishna, "MAERI: Enabling flexible dataflow mapping over dnn accelerators via reconfigurable interconnects," in Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '18, (New York, NY, USA), p. 461--475, Association for Computing Machinery, 2018.
[28]
S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, "Cambricon: An instruction set architecture for neural networks," in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 393--405, 2016.
[29]
W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, "FlexFlow: A flexible dataflow accelerator architecture for convolutional neural networks," in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 553--564, 2017.
[30]
S. Venkataramani, A. Ranjan, S. Banerjee, D. Das, S. Avancha, A. Jagannathan, A. Durg, D. Nagaraj, B. Kaul, P. Dubey, and A. Raghunathan, "ScaleDeep: A scalable compute architecture for learning and evaluating deep networks," in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 13--26, 2017.
[31]
H. Jang, J. Kim, J. Jo, J. Lee, and J. Kim, "MnnFast: A fast and scalable system architecture for memory-augmented neural networks," in 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA), pp. 250--263, 2019.
[32]
C. Deng, F. Sun, X. Qian, J. Lin, Z. Wang, and B. Yuan, "TIE: Energy-efficient tensor train-based inference engine for deep neural network," in 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA), pp. 264--277, 2019.
[33]
K. Hegde, J. Yu, R. Agrawal, M. Yan, M. Pellauer, and C. Fletcher, "UCNN: Exploiting computational reuse in deep neural networks via weight repetition," in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 674--687, 2018.
[34]
C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian, Y. Bai, G. Yuan, X. Ma, Y. Zhang, J. Tang, Q. Qiu, X. Lin, and B. Yuan, "CirCNN: Accelerating and compressing deep neural networks using block-circulant weight matrices," in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-50 '17, (New York, NY, USA), p. 395--408, Association for Computing Machinery, 2017.
[35]
L. Song, J. Mao, Y. Zhuo, X. Qian, H. Li, and Y. Chen, "HyPar: Towards hybrid parallelism for deep learning accelerator array," in 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 56--68, 2019.
[36]
J. Park, H. Sharma, D. Mahajan, J. K. Kim, P. Olds, and H. Esmaeilzadeh, "Scale-out acceleration for machine learning," in 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 367--381, 2017.
[37]
H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Mishra, and H. Esmaeilzadeh, "From high-level deep neural models to FPGAs," in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1--12, 2016.
[38]
M. Alwani, H. Chen, M. Ferdman, and P. Milder, "Fused-layer CNN accelerators," in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1--12, 2016.
[39]
M. Song, J. Zhang, H. Chen, and T. Li, "Towards efficient microarchitectural design for accelerating unsupervised GAN-based deep learning," in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 66--77, IEEE, 2018.
[40]
A. Azizimazreah and L. Chen, "Shortcut mining: Exploiting cross-layer shortcut reuse in DCNN accelerators," in 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 94--105, 2019.
[41]
S. Hurkat and J. F. Martínez, "VIP: A versatile inference processor," in 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 345--358, 2019.
[42]
Y. S. Shao, J. Clemons, R. Venkatesan, B. Zimmer, M. Fojtik, N. Jiang, B. Keller, A. Klinefelter, N. Pinckney, P. Raina, S. G. Tell, Y. Zhang, W. J. Dally, J. Emer, C. T. Gray, B. Khailany, and S. W. Keckler, "Simba: Scaling deep-learning inference with multi-chip-module-based architecture," in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '52, (New York, NY, USA), p. 14--27, Association for Computing Machinery, 2019.
[43]
C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester, D. Blaauw, and R. Das, "Neural cache: Bit-serial in-cache acceleration of deep neural networks," in Proceedings of the 45th Annual International Symposium on Computer Architecture, ISCA'18, pp. 383--396, IEEE Press, 2018.
[44]
Y. Kwon, Y. Lee, and M. Rhu, "TensorDIMM: A practical near-memory processing architecture for embeddings and tensor operations in deep learning," in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '52, (New York, NY, USA), p. 740--753, Association for Computing Machinery, 2019.
[45]
J. R. Stevens, A. Ranjan, D. Das, B. Kaul, and A. Raghunathan, "Manna: An accelerator for memory-augmented neural networks," in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '52, (New York, NY, USA), p. 794--806, Association for Computing Machinery, 2019.
[46]
S. Li, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, and Y. Xie, "DRISA: A dram-based reconfigurable in-situ accelerator," in 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 288--301, 2017.
[47]
M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, "TETRIS: Scalable and efficient neural network acceleration with 3D memory," SIGPLAN Not., vol. 52, p. 751--764, Apr. 2017.
[48]
H. Kim, J. Sim, Y. Choi, and L. Kim, "NAND-Net: Minimizing computational complexity of in-memory processing for binary neural networks," in 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 661--673, 2019.
[49]
S. Li, A. O. Glova, X. Hu, P. Gu, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, and Y. Xie, "SCOPE: A stochastic computing engine for DRAM-based in-situ accelerator," in 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 696--709, 2018.
[50]
X. Wang, J. Yu, C. Augustine, R. Iyer, and R. Das, "Bit prudent in-cache acceleration of deep convolutional neural networks," in 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 81--93, 2019.
[51]
J. Liu, H. Zhao, M. A. Ogleari, D. Li, and J. Zhao, "Processing-in-memory for energy-efficient neural network training: A heterogeneous approach," in 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 655--668, 2018.
[52]
M. Imani, M. Samragh Razlighi, Y. Kim, S. Gupta, F. Koushanfar, and T. Rosing, "Deep learning acceleration with neuron-to-memory transformation," in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 1--14, 2020.
[53]
Y. Ji, Y. Zhang, X. Xie, S. Li, P. Wang, X. Hu, Y. Zhang, and Y. Xie, "FPSA: A full system stack solution for reconfigurable ReRAM-based NN accelerator architecture," in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS'19, (New York, NY, USA), pp. 733--747, Association for Computing Machinery, 2019.
[54]
H. Mao, M. Song, T. Li, Y. Dai, and J. Shu, "LerGAN: a zero-free, low data movement and PIM-based GAN architecture," in 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 669--681, 2018.
[55]
T. Yang, H. Cheng, C. Yang, I. Tseng, H. Hu, H. Chang, and H. Li, "Sparse ReRAM engine: joint exploration of activation and weight sparsity in compressed neural networks," in 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA), pp. 236--249, 2019.
[56]
P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, "PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory," in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 27--39, 2016.
[57]
L. Song, X. Qian, H. Li, and Y. Chen, "PipeLayer: A pipelined ReRAM-based accelerator for deep learning," in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 541--552, 2017.
[58]
A. Ankit, I. E. Hajj, S. R. Chalamalasetti, G. Ndu, M. Foltin, R. S. Williams, P. Faraboschi, W.-m. W. Hwu, J. P. Strachan, K. Roy, and D. S. Milojicic, "PUMA: A programmable ultra-efficient memristor-based accelerator for machine learning inference," in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '19, (New York, NY, USA), pp. 715--731, Association for Computing Machinery, 2019.
[59]
M. N. Bojnordi and E. Ipek, "Memristive boltzmann machine: A hardware accelerator for combinatorial optimization and deep learning," in 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 1--13, 2016.
[60]
X. Zhang, S. L. Song, C. Xie, J. Wang, W. Zhang, and X. Fu, "Enabling highly efficient capsule networks processing through a PIM-based architecture design," in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 542--555, 2020.
[61]
M. Imani, S. Gupta, Y. Kim, and T. Rosing, "FloatPIM: In-memory acceleration of deep neural network training with high precision," in 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA), pp. 802--815, 2019.
[62]
Khadas, Shenzhen Wesion Technology Co., Ltd., "VIM3." https://www.khadas.com/vim3l, 2019.
[63]
Fuzhou Rockchip Electronics Co., Ltd., "Rockchip RK1808." https://www.rock-chips.com/a/en/products/RK18_Series/2019/0529/989.html, 2019.
[64]
Sophon Technology (Beijing) Co., Ltd., "Tensor Computing Processor BM1880." https://www.sophon.ai/product/introduce/bm1880.html, 2018.
[65]
Shenzhen LeMaker Technology Co., Ltd, "HiKey 970." http://www.lemaker.org/product-hikey970-specification.html, 2018.
[66]
NVIDIA Corporation, "Jetson Nano Developer Kit." https://developer.nvidia.com/embedded/jetson-nano-developer-kit, 2019.
[67]
QNAP, "QM2 Expansion Card (Add M.2 SSD Slots)." https://www.qnap.com/en/product/qm2-m.2ssd, 2020.
[68]
V. Sze, Y. H. Chen, T. J. Yang, and J. S. Emer, "How to evaluate deep neural network processors: Tops/w (alone) considered harmful," IEEE Solid-State Circuits Magazine, vol. 12, no. 3, pp. 28--41, 2020.
[69]
J. J. Dongarra and D. C. Sorensen, "Linear algebra on high performance computers," Applied mathematics and computation, vol. 20, no. 1--2, pp. 57--88, 1986.
[70]
M. A. Laurenzano, P. Hill, M. Samadi, S. Mahlke, J. Mars, and L. Tang, "Input responsiveness: Using canary inputs to dynamically steer approximation," in Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '16, (New York, NY, USA), pp. 161--176, ACM, 2016.
[71]
Zhang Xianyi and Martin Kroeker, "OpenBLAS: An optimized BLAS library." https://www.openblas.net/, 2021.
[72]
NVIDIA, "cuBLAS." https://docs.nvidia.com/cuda/cublas/index.html, 2019.
[73]
L. Page, S. Brin, R. Motwani, and T. Winograd, "The PageRank citation ranking: Bringing order to the web.," Technical Report 1999-66, Stanford InfoLab, November 1999. Previous number = SIDL-WP-1999-0120.
[74]
T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson, Introduction to Algorithms. McGraw-Hill Higher Education, 2nd ed., 2001.
[75]
K. Aludaat and M. Alodat, "A note on approximating the normal distribution function," Applied Mathematical Sciences (Ruse), 01 2008.
[76]
M. B. S. Che, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in Proceedings of the IEEE International Symposium on Workload Characterization, IISWC '09, pp. 44--54, Oct 2009.
[77]
N.-M. Ho and W.-F. Wong, "Exploiting half precision arithmetic in nvidia gpus," in 2017 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1--7, 2017.
[78]
A. Yazdanbakhsh, D. Mahajan, H. Esmaeilzadeh, and P. Lotfi-Kamran, "AxBench: A Multiplatform Benchmark Suite for Approximate Computing," IEEE Design Test, vol. 34, pp. 60--68, April 2017.
[79]
Daya S Khudia and Protonu Basu and Summer Deng, "Open-sourcing FBGEMM for state-of-the-art server-side inference." https://engineering.fb.com/ml-applications/fbgemm/, 2018.
[80]
Carl Yang and Aydin Buluc and Yangzihao Wang and John D. Owens, "GraphBLAST." https://github.com/gunrock/graphblast, 2019.
[81]
H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, "Neural acceleration for general-purpose approximate programs," in 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 449--460, 2012.
[82]
A. Yazdanbakhsh, J. Park, H. Sharma, P. Lotfi-Kamran, and H. Esmaeilzadeh, "Neural acceleration for GPU throughput processors," in Proceedings of the 48th International Symposium on Microarchitecture, MICRO-48, (New York, NY, USA), pp. 482--493, ACM, 2015.
[83]
A. M. Caulfield, E. S. Chung, A. Putnam, H. Angepat, J. Fowers, M. Haselman, S. Heil, M. Humphrey, P. Kaur, J. Kim, D. Lo, T. Massengill, K. Ovtcharov, M. Papamichael, L. Woods, S. Lanka, D. Chiou, and D. Burger, "A cloud-scale acceleration architecture," in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1--13, IEEE, 2016.
[84]
K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov, M. Fawzy, B. Jia, Y. Jia, A. Kalro, J. Law, K. Lee, J. Lu, P. Noordhuis, M. Smelyanskiy, L. Xiong, and X. Wang, "Applied machine learning at Facebook: A datacenter infrastructure perspective," in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 620--629, 2018.
[85]
A. Putnam, A. Caulfield, E. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. Gopal, J. Gray, M. Haselman, S. Hauck, S. Heil, A. Hormati, J.-Y. Kim, S. Lanka, J. Larus, E. Peterson, S. Pope, A. Smith, J. Thong, P. Xiao, and D. Burger, "A reconfigurable fabric for accelerating large-scale datacenter services," in Computer Architecture (ISCA), 2014 ACM/IEEE 41st International Symposium on, pp. 13--24, June 2014.
[86]
J. Dong, Z. Cao, T. Zhang, J. Ye, S. Wang, F. Feng, L. Zhao, X. Liu, L. Song, L. Peng, Y. Guo, X. Jiang, L. Tang, Y. Du, Y. Zhang, P. Pan, and Y. Xie, "EFLOPS: Algorithm and system co-design for a high performance distributed training platform," in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 610--622, 2020.
[87]
D. Richins, D. Doshi, M. Blackmore, A. Thulaseedharan Nair, N. Pathapati, A. Patel, B. Daguman, D. Dobrijalowski, R. Illikkal, K. Long, D. Zimmerman, and V. Janapa Reddi, "Missing the forest for the trees: End-to-end AI application performance in edge data centers," in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 515--528, 2020.
[88]
S. Zheng, Y. Liang, S. Wang, R. Chen, and K. Sheng, "Flextensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system," in Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '20, (New York, NY, USA), pp. 859--873, Association for Computing Machinery, 2020.
[89]
H. Sharif, P. Srivastava, M. Huzaifa, M. Kotsifakou, K. Joshi, Y. Sarita, N. Zhao, V. S. Adve, S. Misailovic, and S. Adve, "Approxhpvm: A portable compiler ir for accuracy-aware optimizations," vol. 3, no. OOPSLA, 2019.
[90]
H. Sharif, Y. Zhao, M. Kotsifakou, A. Kothari, B. Schreiber, E. Wang, Y. Sarita, N. Zhao, K. Joshi, V. S. Adve, S. Misailovic, and S. Adve, "Approximating APSP without Scaling: Equivalence of Approximate Min-plus and Exact Min-Max," in Symposium on Principles and Practice of Parallel Programming, PPoPP 2021, 2021.
[91]
S. Chou, F. Kjolstad, and S. Amarasinghe, "Format abstraction for sparse tensor algebra compilers," Proc. ACM Program. Lang., vol. 2, pp. 123:1--123:30, Oct. 2018.
[92]
P. Holanda and H. Mühleisen, "Relational queries with a tensor processing unit," in Proceedings of the 15th International Workshop on Data Management on New Hardware, DaMoN'19, (New York, NY, USA), Association for Computing Machinery, 2019.
[93]
A. Dakkak, C. Li, J. Xiong, I. Gelado, and W.-m. Hwu, "Accelerating reduction and scan using tensor core units," in Proceedings of the ACM International Conference on Supercomputing, ICS '19, (New York, NY, USA), p. 46?V57, Association for Computing Machinery, 2019.
[94]
C. Ma, T. Marin, T. Lu, Y. fan Chen, and Y. Zhuo, "Accelerating mri reconstruction on tpus," 2020.
[95]
A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, "SCNN: An accelerator for compressed-sparse convolutional neural networks," in Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA '17, (New York, NY, USA), p. 27--40, Association for Computing Machinery, 2017.
[96]
A. Gondimalla, N. Chesnut, M. Thottethodi, and T. N. Vijaykumar, "SparTen: A sparse tensor accelerator for convolutional neural networks," in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '19, Association for Computing Machinery, 2019.
[97]
Z. Zhang, H. Wang, S. Han, and W. J. Dally, "SpArch: Efficient architecture for sparse matrix multiplication," in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 261--274, IEEE, 2020.
[98]
J. Yu, A. Lukefahr, D. Palframan, G. Dasika, R. Das, and S. Mahlke, "Scalpel: Customizing DNN pruning to the underlying hardware parallelism," in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 548--560, 2017.
[99]
E. Qin, A. Samajdar, H. Kwon, V. Nadella, S. Srinivasan, D. Das, B. Kaul, and T. Krishna, "SIGMA: A sparse and irregular GEMM accelerator with flexible interconnects for DNN training," in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 58--70, 2020.
[100]
S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, "Cambricon-X: an accelerator for sparse neural networks," in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1--12, 2016.
[101]
A. Delmas Lascorz, P. Judd, D. M. Stuart, Z. Poulos, M. Mahmoud, S. Sharify, M. Nikolic, K. Siu, and A. Moshovos, "Bit-Tactical: A software/hardware approach to exploiting value and bit sparsity in neural networks," in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '19, (New York, NY, USA), p. 749--763, Association for Computing Machinery, 2019.
[102]
J. Albericio, A. Delmás, P. Judd, S. Sharify, G. O'Leary, R. Genov, and A. Moshovos, "Bit-pragmatic deep neural network computing," in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-50 '17, (New York, NY, USA), p. 382--394, Association for Computing Machinery, 2017.
[103]
S. Pal, J. Beaumont, D. Park, A. Amarnath, S. Feng, C. Chakrabarti, H. Kim, D. Blaauw, T. Mudge, and R. Dreslinski, "OuterSPACE: An outer product based sparse matrix multiplication accelerator," in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 724--736, 2018.
[104]
S. Sharify, A. D. Lascorz, M. Mahmoud, M. Nikolic, K. Siu, D. M. Stuart, Z. Poulos, and A. Moshovos, "Laconic deep learning inference acceleration," in 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA), pp. 304--317, 2019.
[105]
H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, V. Chandra, and H. Esmaeilzadeh, "Bit Fusion: Bit-level dynamically composable architecture for accelerating deep neural networks," in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 764--775, 2018.
[106]
M. Zhu, T. Zhang, Z. Gu, and Y. Xie, "Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern GPUs," in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO '52, (New York, NY, USA), pp. 359--371, Association for Computing Machinery, 2019.
[107]
C. Deng, S. Liao, Y. Xie, K. K. Parhi, X. Qian, and B. Yuan, "PermDNN: Efficient compressed DNN architecture with permuted diagonal matrices," in 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 189--202, 2018.
[108]
E. Park, D. Kim, and S. Yoo, "Energy-efficient neural network accelerator based on outlier-aware low-precision computation," in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 688--698, 2018.
[109]
M. Song, J. Zhao, Y. Hu, J. Zhang, and T. Li, "Prediction based execution on deep neural networks," in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 752--763, 2018.
[110]
M. Rhu, M. O'Connor, N. Chatterjee, J. Pool, Y. Kwon, and S. W. Keckler, "Compressing DMA engine: Leveraging activation sparsity for training deep neural networks," in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 78--91, IEEE, 2018.

Cited By

View all
  • (2024)Object Detection with Hyperparameter and Image Enhancement Optimisation for a Smart and Lean Pick-and-Place SolutionSignals10.3390/signals50100055:1(87-104)Online publication date: 26-Feb-2024
  • (2024)An advanced multimodal driver-assistance prototype for emergency-vehicle detectionIntegrated Computer-Aided Engineering10.3233/ICA-24073331:4(381-399)Online publication date: 1-Jan-2024
  • (2024)Simultaneous and Heterogenous Multithreading: Exploiting Simultaneous and Heterogeneous Parallelism in Accelerator-Rich ArchitecturesIEEE Micro10.1109/MM.2024.341494144:4(11-19)Online publication date: Jul-2024
  • Show More Cited By

Index Terms

  1. Accelerating applications using edge tensor processing units
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Conferences
          SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
          November 2021
          1493 pages
          ISBN:9781450384421
          DOI:10.1145/3458817
          This work is licensed under a Creative Commons Attribution International 4.0 License.

          Sponsors

          In-Cooperation

          • IEEE CS

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 13 November 2021

          Check for updates

          Badges

          Qualifiers

          • Research-article

          Conference

          SC '21
          Sponsor:

          Acceptance Rates

          Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)703
          • Downloads (Last 6 weeks)77
          Reflects downloads up to 22 Sep 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2024)Object Detection with Hyperparameter and Image Enhancement Optimisation for a Smart and Lean Pick-and-Place SolutionSignals10.3390/signals50100055:1(87-104)Online publication date: 26-Feb-2024
          • (2024)An advanced multimodal driver-assistance prototype for emergency-vehicle detectionIntegrated Computer-Aided Engineering10.3233/ICA-24073331:4(381-399)Online publication date: 1-Jan-2024
          • (2024)Simultaneous and Heterogenous Multithreading: Exploiting Simultaneous and Heterogeneous Parallelism in Accelerator-Rich ArchitecturesIEEE Micro10.1109/MM.2024.341494144:4(11-19)Online publication date: Jul-2024
          • (2024)Accel-Bench: Exploring the Potential of Programming Using Hardware-Accelerated Functions2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00038(301-303)Online publication date: 5-May-2024
          • (2024)Accelerate Large Language Model Inference on Edge TPU with OpenVX framework2024 IEEE 6th International Conference on AI Circuits and Systems (AICAS)10.1109/AICAS59952.2024.10595950(502-506)Online publication date: 22-Apr-2024
          • (2023)Advancements in Artificial Intelligence Circuits and Systems (AICAS)Electronics10.3390/electronics1301010213:1(102)Online publication date: 26-Dec-2023
          • (2023)Exposing Reliability Degradation and Mitigation in Approximate DNNs Under Permanent FaultsIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2023.323890731:4(555-566)Online publication date: 1-Apr-2023
          • (2023)FPGA based Real-Time simulation of FlyBack converter using graphical programming tools2023 10th International Conference on Modern Power Systems (MPS)10.1109/MPS58874.2023.10187573(01-08)Online publication date: 21-Jun-2023
          • (2023)APPEND: Rethinking ASIP Synthesis in the Era of AI2023 60th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC56929.2023.10247872(1-6)Online publication date: 9-Jul-2023
          • (2023)Opportunities, Applications, and Challenges of Edge-AI Enabled Video Analytics in Smart Cities: A Systematic ReviewIEEE Access10.1109/ACCESS.2023.330065811(80543-80572)Online publication date: 2023
          • Show More Cited By

          View Options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Get Access

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media