skip to main content
10.1145/3168832acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
research-article
Public Access

Transforming loop chains via macro dataflow graphs

Published: 24 February 2018 Publication History

Abstract

This paper describes an approach to performance optimization using modified macro dataflow graphs, which contain nodes representing the loops and data involved in the stencil computation. The targeted applications include existing scientific applications that contain a series of stencil computations that share data, i.e. loop chains. The performance of stencil applications can be improved by modifying the execution schedules. However, modern architectures are increasingly constrained by the memory subsystem bandwidth. To fully realize the benefits of the schedule changes for improved locality, temporary storage allocation must also be minimized.
We present a macro dataflow graph variant that includes dataset nodes, a cost model that quantifies the memory interactions required by a given graph, a set of transformations that can be performed on the graphs such as fusion and tiling, and an approach for generating code to implement the transformed graph. We include a performance comparison with Halide and PolyMage implementations of the benchmark. Our fastest variant outperforms the auto-tuned variants produced by both frameworks.

References

[1]
M. Adams, P. Colella, D. T. Graves, J. N. Johnson, H. S. Johansen, N. D. Keen, T. J. Ligocki, D. F. Martin, P. W. McCorquodale, D. Modiano, P. O. Schwartz, T. D. Sternberg, and B. Van Straalen. 2014. Chombo Software Package for AMR Applications - Design Document. Technical Report LBNL-6616E. Lawrence Berkeley National Laboratory.
[2]
Ann Almgren. 2017. AMReX. https://github.com/AMReX-Codes/AMReX-Codes.github.io. (July 2017).
[3]
Satish Balay, Shrirang Abhyankar, M Adams, Peter Brune, Kris Buschelman, L Dalcin, W Gropp, Barry Smith, D Karpeyev, Dinesh Kaushik, et al. 2016. Petsc users manual revision 3.7. Technical Report. Argonne National Lab.(ANL), Argonne, IL (United States).
[4]
I. J. Bertolacci, M. M. Strout, S. Guzik, J. Riley, and C. Olschanowsky. 2016. Identifying and Scheduling Loop Chains Using Directives. In 2016 Third Workshop on Accelerator Programming Using Directives (WACCPD). IEEE Press, 3 Park Ave, New York, NY, USA, 57-67.
[5]
Zoran Budimlic, Michael Burke, Vincent Cavé, Kathleen Knobe, Geoff Lowney, Ryan Newton, Jens Palsberg, David Peixotto, Vivek Sarkar, Frank Schlimbach, et al. 2010. Concurrent collections. Scientific Programming 18, 3-4 (2010), 203-217.
[6]
Chun Chen, Jacqueline Chame, and Mary Hall. 2008. CHiLL: A framework for composing high-level loop transformations. Technical Report. Technical Report 08-897, U. of Southern California.
[7]
P. Colella, D. T. Graves, T. J. Ligocki, D. F. Martin, and B. Van Straalen. 2008. AMR Godunov Unsplit Algorithm and Implementation. Technical Report. Lawrence Berkeley National Laboratory.
[8]
W. Crutchfield and M. Welcome. 1993. Object-Oriented Implementation of Adaptive Mesh Refinement Algorithms. Scientific Programming 2 (1993), 145-156.
[9]
Sylvain Girbal, Nicolas Vasilache, Cédric Bastoul, Albert Cohen, David Parello, Marc Sigler, and Olivier Temam. 2006. Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies. International Journal of Parallel Programming 34, 3 (2006), 261-317.
[10]
Mary Hall, Jacqueline Chame, Chun Chen, Jaewook Shin, Gabe Rudy, and Malik Murtaza Khan. 2010. Loop Transformation Recipes for Code Generation and Auto-Tuning. In Languages and Compilers for Parallel Computing, Vol. 5898. Springer Berlin Heidelberg, Springer Publishing, Salmon Tower Building, New York, NY, USA, 50-64.
[11]
Ramgopal Kashyap and Pratima Gautam. 2016. Fast Level Set Method for Segmentation of Medical Images. In Proceedings of the International Conference on Informatics and Analytics. ACM, ACM, 2 Penn Plaza, Ste 701, New York, NY, USA, 20.
[12]
Nick Knight. 2015. Communication-Optimal Loop Nests. Ph.D. Dissertation. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-185.html
[13]
Christopher D. Krieger, Michelle Mills Strout, Catherine Olschanowsky, Andrew Stone, Stephen Guzik, Xinfeng Gao, Carlo Bertolli, Paul H.J. Kelly, Gihan Mudalige, Brian Van Straalen, and Sam Williams. 2013. Loop Chaining: A Programming Abstraction For Balancing Locality and Parallelism. In Proceedings of the 18th International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS). IEEE, IEEE Press, 3 Park Ave, New York, NY, USA, 375-384.
[14]
Chunhua Liao, Daniel J Quinlan, Thomas Panas, and Bronis R De Supinski. 2010. A ROSE-Based OpenMP 3.0 Research Compiler Supporting Multiple Runtime Libraries. IWOMP 6132 (2010), 15-28.
[15]
Kyle T Mandli, Aron J Ahmadia, Marsha Berger, Donna Calhoun, David L George, Yiannis Hadjimichael, David I Ketcheson, Grady I Lemoine, and Randall J LeVeque. 2016. Clawpack: building an open source ecosystem for solving hyperbolic PDEs. Peer J Computer Science 2 (2016), e68.
[16]
Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. 2016. Automatically scheduling halide image processing pipelines. ACM Transactions on Graphics (TOG) 35, 4 (2016), 83.
[17]
Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. Polymage: Automatic optimization for image processing pipelines. In ACM SIGPLAN Notices, Vol. 50. ACM, ACM, 2 Penn Plaza, Ste 701, New York, NY, USA, 429-443.
[18]
Boyana Norris, Albert Hartono, and William Gropp. 2007. Annotations for Productivity and Performance Portability. In Petascale Computing: Algorithms and Applications. Chapman & Hall / CRC Press, Taylor and Francis Group, 3848 FAU Blvd, Boca Raton, FL, USA, 443-462. http://www.mcs.anl.gov/uploads/cels/papers/P1392.pdf Preprint ANL/MCSP1392-0107.
[19]
Catherine Olschanowsky, Michelle Mills Strout, Stephen Guzik, John Loffeld, and Jeffrey Hittinger. 2014. A Study on Balancing Parallelism, Data Locality, and Recomputation in Existing PDE Solvers. In In The IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Press, IEEE Press, 3 Park Ave, New York, NY, USA, 793-804.
[20]
Catherine Olschanowsky, Michelle Mills Strout, Stephen Guzik, John Loffeld, and Jeffrey Hittinger. 2014. A study on balancing parallelism, data locality, and recomputation in existing PDE solvers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, IEEE Press, 3 Park Ave, New York, NY, USA, 793-804.
[21]
Daniel Orozco. 2011. Tideflow: A parallel execution model for high performance computing programs. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on. IEEE, IEEE Press, 3 Park Ave, New York, NY, USA, 211-211.
[22]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices 48, 6 (2013), 519-530.
[23]
S. Ramaswamy and P. Banerjee. 1993. Processor Allocation and Scheduling of Macro Dataflow Graphs on Distributed Memory Multicomputers by the PARADIGM Compiler. In Parallel Processing, 1993. ICPP 1993. International Conference on, Vol. 2. IEEE Press, 3 Park Ave, New York, NY, USA, 134-138.
[24]
Florian Rathgeber, David A Ham, Lawrence Mitchell, Michael Lange, Fabio Luporini, Andrew TT McRae, Gheorghe-Teodor Bercea, Graham R Markall, and Paul HJ Kelly. 2016. Firedrake: automating the finite element method by composing abstractions. ACM Transactions on Mathematical Software (TOMS) 43, 3 (2016), 24.
[25]
Vivek Sarkar and John Hennessy. 1986. Partitioning parallel programs for macro-dataflow. In Proceedings of the 1986 ACM conference on LISP and functional programming. ACM, ACM Press, 2 Penn Plaza, Ste 701, New York, NY, USA, 202-211.
[26]
Alina Sbirlea, Louis-Noel Pouchet, and Vivek Sarkar. 2014. Dfgr an intermediate graph representation for macro-dataflow programs. In Data-Flow Execution Models for Extreme Scale Computing (DFM), 2014 Fourth Workshop on. IEEE, IEEE Press, 3 Park Ave, New York, NY, USA, 38-45.
[27]
Alina Sbîrlea, Jun Shirako, Louis-Noël Pouchet, and Vivek Sarkar. 2015. Polyhedral optimizations for a data-flow graph language. In International Workshop on Languages and Compilers for Parallel Computing. Springer, Springer Publishing, Salmon Tower Building, New York, NY, USA, 57-72.
[28]
G. N. Srinivasa Prasanna, A. Agrawal, and B. R. Musicus. 1994. Hierarchical Compilation of Macro Dataflow Graphs for Multiprocessors with Local Memory. IEEE Trans. Parallel Distrib. Syst. 5, 7 (July 1994), 720-736.
[29]
Vinay Vasista, Kumudha Narasimhan, Siddharth Bhat, and Uday Bondhugula. 2017. Optimizing Geometric Multigrid Method Computation Using a DSL Approach. In In The IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC). ACM Press, 2 Penn Plaza, Ste 701, New York, NY, USA, 1-13.
[30]
Sven Verdoolaege. 2010. isl: An integer set library for the polyhedral model. Mathematical Software ICMS 2010 6327 (2010), 299-302. http://link.springer.com/chapter/10.1007/978-3-642-15582-6
[31]
Sven Verdoolaege. 2015. barvinok: User Guide. Compsys. http://compsys-tools.ens-lyon.fr/iscc/barvinok.pdf
[32]
Sven Verdoolaege and Tobias Grosser. 2012. Polyhedral extraction tool. In Second International Workshop on Polyhedral Compilation Techniques (IMPACT'12), Paris, France. ACM Press, 2 Penn Plaza, Ste 701, New York, NY, USA, 1-8.
[33]
Qing Yi. 2012. POET: a scripting language for applying parameterized source-to-source program transformations. Software: Practice and Experience 42, 6 (2012), 675-706.
[34]
Xing Zhou, Jean-Pierre Giacalone, María Jesús Garzarán, Robert H Kuhn, Yang Ni, and David Padua. 2012. Hierarchical overlapped tiling. In Proceedings of the Tenth International Symposium on Code Generation and Optimization. ACM, ACM Press, 2 Penn Plaza, Ste 701, New York, NY, USA, 207-218.

Cited By

View all
  • (2022)Memory optimizations in an array languageProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571926(1-15)Online publication date: 13-Nov-2022
  • (2022)Memory Optimizations in an Array LanguageSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00036(1-15)Online publication date: Nov-2022
  • (2022)Techniques for Managing Polyhedral Dataflow GraphsLanguages and Compilers for Parallel Computing10.1007/978-3-030-99372-6_9(134-150)Online publication date: 24-Mar-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CGO 2018: Proceedings of the 2018 International Symposium on Code Generation and Optimization
February 2018
377 pages
ISBN:9781450356176
DOI:10.1145/3179541
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. dataflow
  2. loop chain
  3. stencil
  4. storage optimizations

Qualifiers

  • Research-article

Funding Sources

Conference

CGO '18
Sponsor:

Acceptance Rates

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)35
  • Downloads (Last 6 weeks)3
Reflects downloads up to 14 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2022)Memory optimizations in an array languageProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571926(1-15)Online publication date: 13-Nov-2022
  • (2022)Memory Optimizations in an Array LanguageSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00036(1-15)Online publication date: Nov-2022
  • (2022)Techniques for Managing Polyhedral Dataflow GraphsLanguages and Compilers for Parallel Computing10.1007/978-3-030-99372-6_9(134-150)Online publication date: 24-Mar-2022
  • (2021)A Structured Grid Solver with Polyhedral+Dataflow RepresentationLanguages and Compilers for Parallel Computing10.1007/978-3-030-72789-5_10(127-146)Online publication date: 26-Mar-2021
  • (2020)An Effective Fusion and Tile Size Model for PolyMageACM Transactions on Programming Languages and Systems10.1145/340484642:3(1-27)Online publication date: 8-Nov-2020
  • (2019)Flextended TilesACM Transactions on Architecture and Code Optimization10.1145/336938216:4(1-25)Online publication date: 17-Dec-2019
  • (2019)POSTER: A Polyhedral+Dataflow Intermediate Language for Performance Exploration2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT.2019.00064(499-500)Online publication date: Sep-2019

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media