research-article

Public Access

Transforming loop chains via macro dataflow graphs

Authors:

Eddie C. Davis,

Michelle Mills Strout,

Catherine OlschanowskyAuthors Info & Claims

CGO 2018: Proceedings of the 2018 International Symposium on Code Generation and Optimization

Pages 265 - 277

https://doi.org/10.1145/3168832

Published: 24 February 2018 Publication History

Abstract

This paper describes an approach to performance optimization using modified macro dataflow graphs, which contain nodes representing the loops and data involved in the stencil computation. The targeted applications include existing scientific applications that contain a series of stencil computations that share data, i.e. loop chains. The performance of stencil applications can be improved by modifying the execution schedules. However, modern architectures are increasingly constrained by the memory subsystem bandwidth. To fully realize the benefits of the schedule changes for improved locality, temporary storage allocation must also be minimized.

We present a macro dataflow graph variant that includes dataset nodes, a cost model that quantifies the memory interactions required by a given graph, a set of transformations that can be performed on the graphs such as fusion and tiling, and an approach for generating code to implement the transformed graph. We include a performance comparison with Halide and PolyMage implementations of the benchmark. Our fastest variant outperforms the auto-tuned variants produced by both frameworks.

References

[1]

M. Adams, P. Colella, D. T. Graves, J. N. Johnson, H. S. Johansen, N. D. Keen, T. J. Ligocki, D. F. Martin, P. W. McCorquodale, D. Modiano, P. O. Schwartz, T. D. Sternberg, and B. Van Straalen. 2014. Chombo Software Package for AMR Applications - Design Document. Technical Report LBNL-6616E. Lawrence Berkeley National Laboratory.

[2]

Ann Almgren. 2017. AMReX. https://github.com/AMReX-Codes/AMReX-Codes.github.io. (July 2017).

[3]

Satish Balay, Shrirang Abhyankar, M Adams, Peter Brune, Kris Buschelman, L Dalcin, W Gropp, Barry Smith, D Karpeyev, Dinesh Kaushik, et al. 2016. Petsc users manual revision 3.7. Technical Report. Argonne National Lab.(ANL), Argonne, IL (United States).

[4]

I. J. Bertolacci, M. M. Strout, S. Guzik, J. Riley, and C. Olschanowsky. 2016. Identifying and Scheduling Loop Chains Using Directives. In 2016 Third Workshop on Accelerator Programming Using Directives (WACCPD). IEEE Press, 3 Park Ave, New York, NY, USA, 57-67.

Digital Library

[5]

Zoran Budimlic, Michael Burke, Vincent Cavé, Kathleen Knobe, Geoff Lowney, Ryan Newton, Jens Palsberg, David Peixotto, Vivek Sarkar, Frank Schlimbach, et al. 2010. Concurrent collections. Scientific Programming 18, 3-4 (2010), 203-217.

Digital Library

[6]

Chun Chen, Jacqueline Chame, and Mary Hall. 2008. CHiLL: A framework for composing high-level loop transformations. Technical Report. Technical Report 08-897, U. of Southern California.

[7]

P. Colella, D. T. Graves, T. J. Ligocki, D. F. Martin, and B. Van Straalen. 2008. AMR Godunov Unsplit Algorithm and Implementation. Technical Report. Lawrence Berkeley National Laboratory.

[8]

W. Crutchfield and M. Welcome. 1993. Object-Oriented Implementation of Adaptive Mesh Refinement Algorithms. Scientific Programming 2 (1993), 145-156.

Digital Library

[9]

Sylvain Girbal, Nicolas Vasilache, Cédric Bastoul, Albert Cohen, David Parello, Marc Sigler, and Olivier Temam. 2006. Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies. International Journal of Parallel Programming 34, 3 (2006), 261-317.

Digital Library

[10]

Mary Hall, Jacqueline Chame, Chun Chen, Jaewook Shin, Gabe Rudy, and Malik Murtaza Khan. 2010. Loop Transformation Recipes for Code Generation and Auto-Tuning. In Languages and Compilers for Parallel Computing, Vol. 5898. Springer Berlin Heidelberg, Springer Publishing, Salmon Tower Building, New York, NY, USA, 50-64.

Digital Library

[11]

Ramgopal Kashyap and Pratima Gautam. 2016. Fast Level Set Method for Segmentation of Medical Images. In Proceedings of the International Conference on Informatics and Analytics. ACM, ACM, 2 Penn Plaza, Ste 701, New York, NY, USA, 20.

Digital Library

[12]

Nick Knight. 2015. Communication-Optimal Loop Nests. Ph.D. Dissertation. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-185.html

[13]

Christopher D. Krieger, Michelle Mills Strout, Catherine Olschanowsky, Andrew Stone, Stephen Guzik, Xinfeng Gao, Carlo Bertolli, Paul H.J. Kelly, Gihan Mudalige, Brian Van Straalen, and Sam Williams. 2013. Loop Chaining: A Programming Abstraction For Balancing Locality and Parallelism. In Proceedings of the 18th International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS). IEEE, IEEE Press, 3 Park Ave, New York, NY, USA, 375-384.

Digital Library

[14]

Chunhua Liao, Daniel J Quinlan, Thomas Panas, and Bronis R De Supinski. 2010. A ROSE-Based OpenMP 3.0 Research Compiler Supporting Multiple Runtime Libraries. IWOMP 6132 (2010), 15-28.

Digital Library

[15]

Kyle T Mandli, Aron J Ahmadia, Marsha Berger, Donna Calhoun, David L George, Yiannis Hadjimichael, David I Ketcheson, Grady I Lemoine, and Randall J LeVeque. 2016. Clawpack: building an open source ecosystem for solving hyperbolic PDEs. Peer J Computer Science 2 (2016), e68.

[16]

Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. 2016. Automatically scheduling halide image processing pipelines. ACM Transactions on Graphics (TOG) 35, 4 (2016), 83.

Digital Library

[17]

Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. Polymage: Automatic optimization for image processing pipelines. In ACM SIGPLAN Notices, Vol. 50. ACM, ACM, 2 Penn Plaza, Ste 701, New York, NY, USA, 429-443.

Digital Library

[18]

Boyana Norris, Albert Hartono, and William Gropp. 2007. Annotations for Productivity and Performance Portability. In Petascale Computing: Algorithms and Applications. Chapman & Hall / CRC Press, Taylor and Francis Group, 3848 FAU Blvd, Boca Raton, FL, USA, 443-462. http://www.mcs.anl.gov/uploads/cels/papers/P1392.pdf Preprint ANL/MCSP1392-0107.

[19]

Catherine Olschanowsky, Michelle Mills Strout, Stephen Guzik, John Loffeld, and Jeffrey Hittinger. 2014. A Study on Balancing Parallelism, Data Locality, and Recomputation in Existing PDE Solvers. In In The IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Press, IEEE Press, 3 Park Ave, New York, NY, USA, 793-804.

Digital Library

[20]

Catherine Olschanowsky, Michelle Mills Strout, Stephen Guzik, John Loffeld, and Jeffrey Hittinger. 2014. A study on balancing parallelism, data locality, and recomputation in existing PDE solvers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, IEEE Press, 3 Park Ave, New York, NY, USA, 793-804.

Digital Library

[21]

Daniel Orozco. 2011. Tideflow: A parallel execution model for high performance computing programs. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on. IEEE, IEEE Press, 3 Park Ave, New York, NY, USA, 211-211.

Digital Library

[22]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Notices 48, 6 (2013), 519-530.

Digital Library

[23]

S. Ramaswamy and P. Banerjee. 1993. Processor Allocation and Scheduling of Macro Dataflow Graphs on Distributed Memory Multicomputers by the PARADIGM Compiler. In Parallel Processing, 1993. ICPP 1993. International Conference on, Vol. 2. IEEE Press, 3 Park Ave, New York, NY, USA, 134-138.

Digital Library

[24]

Florian Rathgeber, David A Ham, Lawrence Mitchell, Michael Lange, Fabio Luporini, Andrew TT McRae, Gheorghe-Teodor Bercea, Graham R Markall, and Paul HJ Kelly. 2016. Firedrake: automating the finite element method by composing abstractions. ACM Transactions on Mathematical Software (TOMS) 43, 3 (2016), 24.

Digital Library

[25]

Vivek Sarkar and John Hennessy. 1986. Partitioning parallel programs for macro-dataflow. In Proceedings of the 1986 ACM conference on LISP and functional programming. ACM, ACM Press, 2 Penn Plaza, Ste 701, New York, NY, USA, 202-211.

Digital Library

[26]

Alina Sbirlea, Louis-Noel Pouchet, and Vivek Sarkar. 2014. Dfgr an intermediate graph representation for macro-dataflow programs. In Data-Flow Execution Models for Extreme Scale Computing (DFM), 2014 Fourth Workshop on. IEEE, IEEE Press, 3 Park Ave, New York, NY, USA, 38-45.

Digital Library

[27]

Alina Sbîrlea, Jun Shirako, Louis-Noël Pouchet, and Vivek Sarkar. 2015. Polyhedral optimizations for a data-flow graph language. In International Workshop on Languages and Compilers for Parallel Computing. Springer, Springer Publishing, Salmon Tower Building, New York, NY, USA, 57-72.

Digital Library

[28]

G. N. Srinivasa Prasanna, A. Agrawal, and B. R. Musicus. 1994. Hierarchical Compilation of Macro Dataflow Graphs for Multiprocessors with Local Memory. IEEE Trans. Parallel Distrib. Syst. 5, 7 (July 1994), 720-736.

Digital Library

[29]

Vinay Vasista, Kumudha Narasimhan, Siddharth Bhat, and Uday Bondhugula. 2017. Optimizing Geometric Multigrid Method Computation Using a DSL Approach. In In The IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC). ACM Press, 2 Penn Plaza, Ste 701, New York, NY, USA, 1-13.

Digital Library

[30]

Sven Verdoolaege. 2010. isl: An integer set library for the polyhedral model. Mathematical Software ICMS 2010 6327 (2010), 299-302. http://link.springer.com/chapter/10.1007/978-3-642-15582-6

Digital Library

[31]

Sven Verdoolaege. 2015. barvinok: User Guide. Compsys. http://compsys-tools.ens-lyon.fr/iscc/barvinok.pdf

[32]

Sven Verdoolaege and Tobias Grosser. 2012. Polyhedral extraction tool. In Second International Workshop on Polyhedral Compilation Techniques (IMPACT'12), Paris, France. ACM Press, 2 Penn Plaza, Ste 701, New York, NY, USA, 1-8.

[33]

Qing Yi. 2012. POET: a scripting language for applying parameterized source-to-source program transformations. Software: Practice and Experience 42, 6 (2012), 675-706.

Digital Library

[34]

Xing Zhou, Jean-Pierre Giacalone, María Jesús Garzarán, Robert H Kuhn, Yang Ni, and David Padua. 2012. Hierarchical overlapped tiling. In Proceedings of the Tenth International Symposium on Code Generation and Optimization. ACM, ACM Press, 2 Penn Plaza, Ste 701, New York, NY, USA, 207-218.

Digital Library

Cited By

Munksgaard PHenriksen TSadayappan POancea CWolf FShende SCulhane CAlam SJagode H(2022)Memory optimizations in an array languageProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571926(1-15)Online publication date: 13-Nov-2022
https://dl.acm.org/doi/10.5555/3571885.3571926
Munksgaard PHenriksen TSadayappan POancea C(2022)Memory Optimizations in an Array LanguageSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00036(1-15)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00036
Shankar ROrenstein ARift APopoola TLowe MYang SMikesell TOlschanowsky C(2022)Techniques for Managing Polyhedral Dataflow GraphsLanguages and Compilers for Parallel Computing10.1007/978-3-030-99372-6_9(134-150)Online publication date: 24-Mar-2022
https://doi.org/10.1007/978-3-030-99372-6_9
Show More Cited By

Index Terms

Transforming loop chains via macro dataflow graphs
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
    2. Context specific languages
      1. Macro languages

Recommendations

Developing High-Performance, Portable OpenCL Code via Multi-Dimensional Homomorphisms
IWOCL '19: Proceedings of the International Workshop on OpenCL

A key challenge in programming high-performance applications is achieving portable performance, such that the same program code can reach a consistent level of performance over the variety of modern parallel processors, including multi-core CPU and ...
An Efficient Stream Buffer Mechanism for Dataflow Execution on Heterogeneous Platforms with GPUs
DFM '11: Proceedings of the 2011 First Workshop on Data-Flow Execution Models for Extreme Scale Computing

The move towards heterogeneous parallel computing is underway as witnessed by the emergence of novel computing platforms combining architecturally diverse components such as CPUs, GPUs and special function units. We approach mapping of streaming ...
Optimizing stencil application on multi-thread GPU architecture using stream programming model
ARCS'10: Proceedings of the 23rd international conference on Architecture of Computing Systems

With fast development of GPU hardware and software, using GPUs to accelerate non-graphics CPU applications is becoming inevitable trend. GPUs are good at performing ALU-intensive computation and feature high peak performance; however, how to harness ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CGO 2018: Proceedings of the 2018 International Symposium on Code Generation and Optimization

February 2018

377 pages

ISBN:9781450356176

DOI:10.1145/3179541

General Chairs:
Jens Knoop
Vienna University of Technology, Austria
,
Markus Schordan
Lawrence Livermore National Laboratory, USA
,
Program Chairs:
Teresa Johnson
Google, USA
,
Michael O'Boyle
University of Edinburgh, UK

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

DOE
NSF

Conference

CGO '18

Sponsor:

CGO '18: 16th Annual IEEE/ACM International Symposium on Code Generation and Optimization

February 24 - 28, 2018

Vienna, Austria

Acceptance Rates

Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
507
Total Downloads

Downloads (Last 12 months)35
Downloads (Last 6 weeks)3

Reflects downloads up to 14 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Munksgaard PHenriksen TSadayappan POancea CWolf FShende SCulhane CAlam SJagode H(2022)Memory optimizations in an array languageProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571926(1-15)Online publication date: 13-Nov-2022
https://dl.acm.org/doi/10.5555/3571885.3571926
Munksgaard PHenriksen TSadayappan POancea C(2022)Memory Optimizations in an Array LanguageSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00036(1-15)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00036
Shankar ROrenstein ARift APopoola TLowe MYang SMikesell TOlschanowsky C(2022)Techniques for Managing Polyhedral Dataflow GraphsLanguages and Compilers for Parallel Computing10.1007/978-3-030-99372-6_9(134-150)Online publication date: 24-Mar-2022
https://doi.org/10.1007/978-3-030-99372-6_9
Davis EOlschanowsky CVan Straalen B(2021)A Structured Grid Solver with Polyhedral+Dataflow RepresentationLanguages and Compilers for Parallel Computing10.1007/978-3-030-72789-5_10(127-146)Online publication date: 26-Mar-2021
https://doi.org/10.1007/978-3-030-72789-5_10
Jangda ABondhugula U(2020)An Effective Fusion and Tile Size Model for PolyMageACM Transactions on Programming Languages and Systems10.1145/340484642:3(1-27)Online publication date: 8-Nov-2020
https://dl.acm.org/doi/10.1145/3404846
Zhao JCohen A(2019)Flextended TilesACM Transactions on Architecture and Code Optimization10.1145/336938216:4(1-25)Online publication date: 17-Dec-2019
https://dl.acm.org/doi/10.1145/3369382
Davis EOlschanowsky C(2019)POSTER: A Polyhedral+Dataflow Intermediate Language for Performance Exploration2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT.2019.00064(499-500)Online publication date: Sep-2019
https://doi.org/10.1109/PACT.2019.00064

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents