skip to main content
research-article

Analytical Performance Estimation for Large-Scale Reconfigurable Dataflow Platforms

Published: 12 August 2021 Publication History

Abstract

Next-generation high-performance computing platforms will handle extreme data- and compute-intensive problems that are intractable with today’s technology. A promising path in achieving the next leap in high-performance computing is to embrace heterogeneity and specialised computing in the form of reconfigurable accelerators such as FPGAs, which have been shown to speed up compute-intensive tasks with reduced power consumption. However, assessing the feasibility of large-scale heterogeneous systems requires fast and accurate performance prediction. This article proposes Performance Estimation for Reconfigurable Kernels and Systems (PERKS), a novel performance estimation framework for reconfigurable dataflow platforms. PERKS makes use of an analytical model with machine and application parameters for predicting the performance of multi-accelerator systems and detecting their bottlenecks. Model calibration is automatic, making the model flexible and usable for different machine configurations and applications, including hypothetical ones. Our experimental results show that PERKS can predict the performance of current workloads on reconfigurable dataflow platforms with an accuracy above 91%. The results also illustrate how the modelling scales to large workloads, and how performance impact of architectural features can be estimated in seconds.

References

[1]
Amazon. 2020. Amazon EC2 F1 Instances. Retrieved May 22, 2021 from https://aws.amazon.com/ec2/instance-types/f1/.
[2]
Maxeler. 2020. Maxeler AppGallery. Retrieved May 22, 2021 from http://appgallery.maxeler.com/.
[3]
Maxeler. 2020. Maxeler Technologies Home Page. Retrieved May 22, 2021 from http://maxeler.com/.
[4]
TOP500. 2020. TOP500 Supercomputer Sites. Retrieved May 22, 2021 from https://www.top500.org/lists/2020/11/.
[5]
M. S. B. Altaf and D. A. Wood. 2017. LogCA: A high-level performance model for hardware accelerators. In Proceedings of the 44th Annual International Symposium on Computer Architecture.375–388.
[6]
A.-S. Anghel. 2017. On Large-Scale System Performance Analysis and Software Characterization. Ph.D. Dissertation. ETH Zurich.
[7]
J. Arram, T. Kaplan, W. Luk, and P. Jiang. 2017. Leveraging FPGAs for accelerating short read alignment. IEEE/ACM Transactions on Computational Biology and Bioinformatics 14 (May-June 2017), 668–677.
[8]
J. Arram, W. Luk, and P. Jiang. 2015. Ramethy: Reconfigurable acceleration of bisulfite sequence alignment. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 250–259.
[9]
A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software.163–174.
[10]
P. Balaprakash, D. Buntinas, A. Chan, A. Guha, R. Gupta, S. H. K. Narayanan, A. A. Chien, P. Hovland, and B. Norris. 2013. Exascale workload characterization and architecture implications. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software.
[11]
J. Bang-Jensen and G. Gutin. 2008. Digraphs: Theory, Algorithms and Applications (2nd ed.). Springer-Verlag.
[12]
T. Becker, P. Burovskiy, A. M. Nestorov, H. Palikareva, E. Reggiani, and G. Gaydadjiev. 2017. From exaflop to exaflow. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition.404–409.
[13]
B. Bhattacharya and S. S. Bhattacharyya. 2001. Parameterized dataflow modeling for DSP systems. IEEE Transactions on Signal Processing 49, 10 (2001), 2408–2421.
[14]
J. Blieberger. 2002. Data-flow frameworks for worst-case execution time analysis. Real-Time Systems 22, 3 (2002), 183–227.
[15]
A. Bouakaz, P. Fradet, and A. Girault. 2017. A survey of parametric dataflow models of computation. ACM Transactions on Design Automation of Electronic Systems. 22, 2 (2017), 38.
[16]
S. Collange, M. Daumas, D. Defour, and D. Parello. 2010. Barra: A parallel functional simulator for GPGPU. In Proceedings of the IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.351–360.
[17]
A.-I. Cross, L. Guo, W. Luk, and M. Salmon. 2018. CJS: Custom Jacobi solver. In Proceedings of the International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies. 1–6.
[18]
A.-I. Cross, L. Guo, W. Luk, and M. Salmon. 2018. CRRS: Custom regression and regularisation solver for large-scale linear systems. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18).
[19]
J. Curreri, S. Koehler, A. D. George, B. Holland, and R. Garcia. 2010. Performance analysis framework for high-level language applications in reconfigurable computing. ACM Transactions on Reconfigurable Technology and Systems 3 (Jan. 2010), Article 5.
[20]
B. da Silva, A. Braeken, E. H. D’Hollander, and A. Touhafi. 2013. Performance modeling for FPGAs: Extending the roofline model with high-level synthesis tools. International Journal of Reconfigurable Computing 2013 (Nov. 2013), Article 7.
[21]
J. B. Dennis. 1980. Data flow supercomputers. Computer 13, 11 (Nov. 1980), 48–56.
[22]
H. Fu, L. Gan, R. G. Clapp, H. Ruan, O. Pell, O. Mencer, M. Flynn, X. Huang, and G. Yang. 2014. Scaling reverse time migration performance through reconfigurable dataflow engines. IEEE Micro 34, 1 (2014), 30–40.
[23]
L. Gan, H. Fu, W. Luk, C. Yang, W. Xue, and G. Yang. 2017. Solving mesoscale atmospheric dynamics using a reconfigurable dataflow architecture. IEEE Micro 37, 4 (2017), 40–50.
[24]
M. R. Garey and D. S. Johnson. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Company.
[25]
A. H. Ghamarian, M. C. W. Geilen, S. Stuijk, T. Basten, B. D. Theelen, M. R. Mousavi, A. J. M. Moonen, and M. J. G. Bekooij. 2006. Throughput analysis of synchronous data flow graphs. In Proceedings of the International Conference on Application of Concurrency to System Design. 25–36.
[26]
T. Graepel, J. Q. Candela, T. Borchert, and R. Herbrich. 2010. Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft’s bing search engine. In Proceedings of the 27th International Conference on Machine Learning.13–20.
[27]
P. Grigoras, M. Tottenham, X. Niu, J. G. F. Coutinho, and W. Luk. 2014. Elastic management of reconfigurable accelerators. In Proceedings of the International Symposium on Parallel and Distributed Processing with Applications. 174–181.
[28]
J. Hennessy and D. Patterson. 2018. A New Golden Age for Computer Architecture. Turing Award Lecture. Retrieved May 22, 2021 from http://iscaconf.org/isca2018/docs/HennessyPattersonTuringLectureISCA4June2018.pdf.
[29]
S. Hong and H. Kim. 2009. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proceedings of the International Symposium on Computer Architecture. 152–163.
[30]
H. Jia, Y. Zhang, G. Long, J. Xu, S. Yan, and Y. Li. 2012. GPURoofline: A model for guiding performance optimizations on GPUs. In Proceedings of the European Conference on Parallel Processing.920–932.
[31]
A. Kerr, G. Diamos, and S. Yalamanchili. 2010. Modeling GPU-CPU workloads and systems. In Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units. 31–42.
[32]
Al. C. J. Kienhuis. 1999. Design space exploration of stream-based dataflow architectures. Nederlands Elektronica en Radiogenootschap 64, 5 (1999), 191.
[33]
L. Gan, H. Fu, C. Yang, W. Luk, W. Xue, O. Mencer, X. Huang, and G. Yang.2014. A highly-efficient and green data flow engine for solving Euler atmospheric equations. In Proceedings of the International Conference on Field Programmable Logic and Applications. 1–6.
[34]
E. A. Lee and D. G. Messerschmitt. 1987. Synchronous data flow. Proceedings of the IEEE 75, 9 (1987), 1235–1245.
[35]
S. Lee, J. S. Meredith, and J. S. Vetter. 2015. COMPASS: A framework for automated performance modeling and prediction. In Proceedings of the International Conference on Supercomputing.
[36]
A. M. Nestorov, E. Reggiani, H. Palikareva, P. Burovskiy, T. Becker, and M. D. Santambrogio. 2017. A scalable dataflow implementation of Curran’s approximation algorithm. In Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops.
[37]
H. Rihani, M. Moy, C. Maiza, R. I. Davis, and S. Altmeyer. 2016. Response time analysis of synchronous data flow programs on a many-core processor. In Proceedings of the 24th International Conference on Real-Time Networks and Systems. 67–76.
[38]
K. Sato, K. Komatsu, H. Takizawa, and H. Kobayashi. 2011. A history-based performance prediction model with profile data classification for automatic task allocation in heterogeneous computing systems. In Proceedings of the 2011 IEEE 9th International Symposium on Parallel and Distributed Processing with Applications. 135–142.
[39]
R. F. Service. 2012. What it’ll take to go exascale. Science 27 (2012), 394–396.
[40]
J. Shalf, S. Dosanjh, and J. Morrison. 2010. Exascale computing technology challenges. In Proceedings of the International Conference on High Performance Computing for Computational Science. 1–25.
[41]
R. Ubal, B. Jang, P. Mistry, D. Schaa, and D. Kaeli. 2012. Multi2Sim: A simulation framework for CPU-GPU computing. In Proceedings of the 2012 21st International Conference on Parallel Architectures and Compilation Techniques.335–344.
[42]
D. Unat, C. Chan, W. Zhang, S. Williams, J. Bachan, J. Bell, and J. Shalf. 2015. ExaSAT: An exascale co-design tool for performance modeling. International Journal of High Performance Computing Application 29 (June 2015), 209–232.
[43]
S. Williams, A. Waterman, and D. Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Commications of the ACM 52 (April 2009), 65–76.
[44]
R. Yasudo, J. Coutinho, A. Varbanescu, W. Luk, H. Amano, and T. Becker. 2018. Performance estimation for exascale reconfigurable dataflow platforms. In Proceedings of the International Conference on Field-Programmable Technology (FPT’18). 314–317.

Index Terms

  1. Analytical Performance Estimation for Large-Scale Reconfigurable Dataflow Platforms

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Reconfigurable Technology and Systems
      ACM Transactions on Reconfigurable Technology and Systems  Volume 14, Issue 3
      September 2021
      137 pages
      ISSN:1936-7406
      EISSN:1936-7414
      DOI:10.1145/3472296
      • Editor:
      • Deming Chen
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 August 2021
      Accepted: 01 February 2021
      Revised: 01 December 2020
      Received: 01 June 2020
      Published in TRETS Volume 14, Issue 3

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Performance modelling
      2. heterogeneous systems
      3. reconfigurable dataflow platforms
      4. FPGAs

      Qualifiers

      • Research-article
      • Refereed

      Funding Sources

      • EU H2020 Research and Innovation Programme
      • UK EPSRC
      • JST/CREST program “Research and Development on Unified Environment of Accelerated Computing and Interconnection for Post-Petascale Era”
      • JSPS KAKENHI

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 192
        Total Downloads
      • Downloads (Last 12 months)31
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 14 Sep 2024

      Other Metrics

      Citations

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media