research-article

Free access

Just Accepted

Canalis: A Throughput-Optimized Framework for Real-Time Stream Processing of Wireless Communication

Authors: Kuan-Yu Chen, Thomas Mason Nelson, Alireza Khadem, Morteza Fayazi, Sanjay Sri Vallabh Singapuram, Ronald Dreslinski, Nishil Talati, Hun-Seok Kim, David BlaauwAuthors Info & Claims

ACM Transactions on Reconfigurable Technology and Systems

Accepted on 28 August 2024

https://doi.org/10.1145/3695880

Online AM: 18 September 2024 Publication History

Abstract

Stream processing, which involves real-time computation of data as it is created or received, is vital for various applications, specifically wireless communication. The evolving protocols, the requirement for high-throughput, and the challenges of handling diverse processing patterns make it demanding. Traditional platforms grapple with meeting real-time throughput and latency requirements due to large data volume, sequential and indeterministic data arrival, and variable data rates, leading to inefficiencies in memory access and parallel processing. We present Canalis, a throughput-optimized framework designed to address these challenges, ensuring high-performance while achieving low energy consumption. Canalis is a hardware-software co-designed system. It includes a programmable spatial architecture, FluxSPU (Flux Stream Processing Unit), proposed by this work to enhance data throughput and energy efficiency. FluxSPU is accompanied by a software stack that eases the programming process. We evaluated Canalis with eight distinct benchmarks. When compared to CPU and GPU in mobile SoC to demonstrate the effectiveness of domain specialization, Canalis achieves an average speedup of 13.4× and 6.6×, and energy savings of 189.8× and 283.9×, respectively. In contrast to equivalent ASICs of the benchmarks, the average energy overhead of Canalis is within 2.4×, successfully maintaining generalizations without incurring significant overhead.

References

[1]

[n. d.]. Arm Optimized Routines. https://github.com/ARM-software/optimized-routines

[2]

[n. d.]. clBLAS. https://github.com/clMathLibraries/clBLAS

[3]

[n. d.]. clFFT. https://github.com/clMathLibraries/clFFT

[4]

[n. d.]. CMSIS-DSP. https://github.com/ARM-software/CMSIS-DSP

[5]

[n. d.]. PFFFT: a pretty fast FFT and fast convolution with PFFASTCONV. https://github.com/marton78/pffft

[6]

[n. d.]. XNNPACK. https://github.com/google/XNNPACK

[7]

Arvind and R.S. Nikhil. 1990. Executing a program on the MIT tagged-token dataflow architecture. IEEE Trans. Comput. 39, 3 (1990), 300–318. https://doi.org/10.1109/12.48862

Digital Library

[8]

B.M. Baas. 2003. A parallel programmable energy-efficient architecture for computationally-intensive DSP systems. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, Vol. 2. 2185–2189 Vol.2. https://doi.org/10.1109/ACSSC.2003.1292368

[9]

James Balfour, William Dally, David Black-Schaffer, Vishal Parikh, and JongSoo Park. 2008. An Energy-Efficient Processor Architecture for Embedded Systems. IEEE Computer Architecture Letters 7, 1 (2008), 29–32. https://doi.org/10.1109/L-CA.2008.1

Digital Library

[10]

Thilini Kaushalya Bandara, Dhananjaya Wijerathne, Tulika Mitra, and Li-Shiuan Peh. 2022. REVAMP: A Systematic Framework for Heterogeneous CGRA Realization. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (Lausanne, Switzerland) (ASPLOS ’22). Association for Computing Machinery, New York, NY, USA, 918–932. https://doi.org/10.1145/3503222.3507772

Digital Library

[11]

M. Bedford Taylor, W. Lee, S. Amarasinghe, and A. Agarwal. 2003. Scalar operand networks: on-chip interconnect for ILP in partitioned architectures. In The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings. 341–353. https://doi.org/10.1109/HPCA.2003.1183551

[12]

G. Bilsen, M. Engels, R. Lauwereins, and J.A. Peperstraete. 1995. Cyclo-static data flow. In 1995 International Conference on Acoustics, Speech, and Signal Processing, Vol. 5. 3255–3258 vol.5. https://doi.org/10.1109/ICASSP.1995.479579

[13]

Brent Bohnenstiehl, Aaron Stillmaker, Jon J. Pimentel, Timothy Andreas, Bin Liu, Anh T. Tran, Emmanuel Adeagbo, and Bevan M. Baas. 2017. KiloCore: A 32-nm 1000-Processor Computational Array. IEEE Journal of Solid-State Circuits 52, 4 (2017), 891–902. https://doi.org/10.1109/JSSC.2016.2638459

[14]

Mihai Budiu, Girish Venkataramani, Tiberiu Chelcea, and Seth Copen Goldstein. 2004. Spatial Computation. SIGOPS Oper. Syst. Rev. 38, 5 (oct 2004), 14–26. https://doi.org/10.1145/1037949.1024396

Digital Library

[15]

Alex Carsello, Kathleen Feng, Taeyoung Kong, Kalhan Koul, Qiaoyi Liu, Jackson Melchert, Gedeon Nyengele, Maxwell Strange, Keyi Zhang, Ankita Nayak, Jeff Setter, James Thomas, Kavya Sreedhar, Po-Han Chen, Nikhil Bhagdikar, Zachary Myers, Brandon D’Agostino, Pranil Joshi, Stephen Richardson, Rick Bahr, Christopher Torng, Mark Horowitz, and Priyanka Raina. 2022. Amber: A 367 GOPS, 538 GOPS/W 16nm SoC with a Coarse-Grained Reconfigurable Array for Flexible Acceleration of Dense Linear Algebra. In 2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits). 70–71. https://doi.org/10.1109/VLSITechnologyandCir46769.2022.9830509

[16]

Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, John Wawrzynek, and André DeHon. 2000. Stream Computations Organized for Reconfigurable Execution (SCORE). In Proceedings of the The Roadmap to Reconfigurable Computing, 10th International Workshop on Field-Programmable Logic and Applications (FPL ’00). Springer-Verlag, Berlin, Heidelberg, 605–614.

Digital Library

[17]

Kuan-Yu Chen, Chi-Sheng Yang, Yu-Hsiu Sun, Chien-Wei Tseng, Morteza Fayazi, Xin He, Siying Feng, Yufan Yue, Trevor Mudge, Ronald Dreslinski, Hun-Seok Kim, and David Blaauw. 2022. A 507 GMACs/J 256-Core Domain Adaptive Systolic-Array-Processor for Wireless Communication and Linear-Algebra Kernels in 12nm FINFET. In 2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits). 202–203. https://doi.org/10.1109/VLSITechnologyandCir46769.2022.9830330

[18]

Longlong Chen, Jianfeng Zhu, Yangdong Deng, Zhaoshi Li, Jian Chen, Xiaowei Jiang, Shouyi Yin, Shaojun Wei, and Leibo Liu. 2021. An Elastic Task Scheduling Scheme on Coarse-Grained Reconfigurable Architectures. IEEE Transactions on Parallel and Distributed Systems 32, 12 (2021), 3066–3080. https://doi.org/10.1109/TPDS.2021.3084804

[19]

Tao Chen, Shreesha Srinath, Christopher Batten, and G. Edward Suh. 2018. An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 55–67. https://doi.org/10.1109/MICRO.2018.00014

Digital Library

[20]

Silviu Ciricescu, Ray Essick, Brian Lucas, Phil May, Kent Moat, Jim Norris, Michael Schuette, and Ali Saidi. 2003. The Reconfigurable Streaming Vector Processor (RSVPTM). In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 36). IEEE Computer Society, USA, 141.

Digital Library

[21]

Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Beayna Grigorian, Karthik Gururaj, and Glenn Reinman. 2014. Accelerator-rich architectures: Opportunities and progresses. In 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 1–6.

Digital Library

[22]

Jason Cong, Hui Huang, Chiyuan Ma, Bingjun Xiao, and Peipei Zhou. 2014. A Fully Pipelined and Dynamically Composable Architecture of CGRA. In 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines. 9–16. https://doi.org/10.1109/FCCM.2014.12

[23]

D.C. Cronquist, P. Franklin, S.G. Berg, and C. Ebeling. 1998. Specifying and compiling applications for RaPiD. In Proceedings. IEEE Symposium on FPGAs for Custom Computing Machines (Cat. No.98TB100251). 116–125. https://doi.org/10.1109/FPGA.1998.707889

[24]

Vidushi Dadu, Jian Weng, Sihao Liu, and Tony Nowatzki. 2019. Towards General Purpose Acceleration by Exploiting Common Data-Dependence Forms. In Proceedings of the 52Nd Annual IEEE/ACM International Symposium on Microarchitecture (Columbus, OH, USA) (MICRO ’52). ACM, 924–939.

Digital Library

[25]

A. DeHon. 1996. DPGA Utilization and Application. In Fourth International ACM Symposium on Field-Programmable Gate Arrays. 115–121. https://doi.org/10.1109/FPGA.1996.242438

[26]

Jack B. Dennis and David P. Misunas. 1974. A Preliminary Architecture for a Basic Data-Flow Processor. SIGARCH Comput. Archit. News 3, 4 (dec 1974), 126–132. https://doi.org/10.1145/641675.642111

Digital Library

[27]

Joao Mario Domingos, Nuno Neves, Nuno Roma, and Pedro Tomás. 2021. Unlimited Vector Extension with Data Streaming Support. In Proceedings of the 48th Annual International Symposium on Computer Architecture (Virtual Event, Spain) (ISCA ’21). IEEE Press, 209–222. https://doi.org/10.1109/ISCA52012.2021.00025

Digital Library

[28]

Carl Ebeling, Darren C. Cronquist, and Paul Franklin. 1996. RaPiD - Reconfigurable Pipelined Datapath. In Proceedings of the 6th International Workshop on Field-Programmable Logic, Smart Applications, New Paradigms and Compilers (FPL ’96). Springer-Verlag, Berlin, Heidelberg, 126–135.

Digital Library

[29]

M.M. Fernandes, J. Llosa, and N. Topham. 1999. Distributed modulo scheduling. In Proceedings Fifth International Symposium on High-Performance Computer Architecture. 130–134. https://doi.org/10.1109/HPCA.1999.744349

[30]

Adi Fuchs and David Wentzlaff. 2019. The accelerator wall: Limits of chip specialization. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1–14.

[31]

Mingyu Gao and Christos Kozyrakis. 2016. HRL: Efficient and flexible reconfigurable logic for near-data processing. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 126–137. https://doi.org/10.1109/HPCA.2016.7446059

[32]

Mario Garrido, J. Grajal, M. A. Sanchez, and Oscar Gustafsson. 2013. Pipelined Radix-2k Feedforward FFT Architectures. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 21, 1 (2013), 23–32. https://doi.org/10.1109/TVLSI.2011.2178275

Digital Library

[33]

Graham Gobieski, Ahmet Oguz Atli, Kenneth Mai, Brandon Lucia, and Nathan Beckmann. 2021. Snafu: An Ultra-Low-Power, Energy-Minimal CGRA-Generation Framework and Architecture. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). 1027–1040. https://doi.org/10.1109/ISCA52012.2021.00084

Digital Library

[34]

Graham Gobieski, Souradip Ghosh, Marijn Heule, Todd Mowry, Tony Nowatzki, Nathan Beckmann, and Brandon Lucia. 2022. RipTide: A Programmable, Energy-Minimal Dataflow Compiler and Architecture. In 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). 546–564. https://doi.org/10.1109/MICRO56248.2022.00046

Digital Library

[35]

Graham Gobieski, Amolak Nagi, Nathan Serafin, Mehmet Meric Isgenc, Nathan Beckmann, and Brandon Lucia. 2019. MANIC: A Vector-Dataflow Architecture for Ultra-Low-Power Embedded Systems. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (Columbus, OH, USA) (MICRO ’52). Association for Computing Machinery, New York, NY, USA, 670–684. https://doi.org/10.1145/3352460.3358277

Digital Library

[36]

Michael I. Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali S. Meli, Andrew A. Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, and Saman Amarasinghe. 2002. A Stream Compiler for Communication-Exposed Architectures. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, California) (ASPLOS X). Association for Computing Machinery, New York, NY, USA, 291–303. https://doi.org/10.1145/605397.605428

Digital Library

[37]

Venkatraman Govindaraju, Chen-Han Ho, Tony Nowatzki, Jatin Chhugani, Nadathur Satish, Karthikeyan Sankaralingam, and Changkyu Kim. 2012. DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing. IEEE Micro 32, 5 (2012), 38–51. https://doi.org/10.1109/MM.2012.51

Digital Library

[38]

Venkatraman Govindaraju, Chen-Han Ho, and Karthikeyan Sankaralingam. 2011. Dynamically specialized datapaths for energy efficient computing. In 2011 IEEE 17th International Symposium on High Performance Computer Architecture. IEEE, 503–514.

[39]

Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. 2010. Understanding Sources of Inefficiency in General-Purpose Chips. In Proceedings of the 37th Annual International Symposium on Computer Architecture (Saint-Malo, France) (ISCA ’10). Association for Computing Machinery, New York, NY, USA, 37–47. https://doi.org/10.1145/1815961.1815968

Digital Library

[40]

Peter E. Hart, Nils J. Nilsson, and Bertram Raphael. 1968. A Formal Basis for the Heuristic Determination of Minimum Cost Paths. IEEE Transactions on Systems Science and Cybernetics 4, 2 (1968), 100–107. https://doi.org/10.1109/TSSC.1968.300136

[41]

Chen-Han Ho, Sung Jin Kim, and Karthikeyan Sankaralingam. 2015. Efficient Execution of Memory Access Phases Using Dataflow Specialization. SIGARCH Comput. Archit. News 43, 3S (jun 2015), 118–130. https://doi.org/10.1145/2872887.2750390

Digital Library

[42]

Olivia Hsu, Alexander Rucker, Tian Zhao, Kunle Olukotun, and Fredrik Kjolstad. 2022. Stardust: Compiling Sparse Tensor Algebra to a Reconfigurable Dataflow Architecture. (2022). https://doi.org/10.48550/ARXIV.2211.03251

[43]

K.T. Johnson, A.R. Hurson, and B. Shirazi. 1993. General-purpose systolic arrays. Computer 26, 11 (1993), 20–31. https://doi.org/10.1109/2.241423

Digital Library

[44]

Anthony Mark Jones and Mike Butts. 2006. TeraOPS hardware: A new massively-parallel MIMD computing fabric IC. In 2006 IEEE Hot Chips 18 Symposium (HCS). 1–15. https://doi.org/10.1109/HOTCHIPS.2006.7477853

[45]

Manupa Karunaratne, Aditi Kulkarni Mohite, Tulika Mitra, and Li-Shiuan Peh. 2017. HyCUBE: A CGRA with Reconfigurable Single-Cycle Multi-Hop Interconnect. In Proceedings of the 54th Annual Design Automation Conference 2017 (Austin, TX, USA) (DAC ’17). Association for Computing Machinery, New York, NY, USA, Article 45, 6 pages. https://doi.org/10.1145/3061639.3062262

Digital Library

[46]

B. Khailany, W.J. Dally, U.J. Kapasi, P. Mattson, J. Namkoong, J.D. Owens, B. Towles, A. Chang, and S. Rixner. 2001. Imagine: media processing with streams. IEEE Micro 21, 2 (2001), 35–46. https://doi.org/10.1109/40.918001

Digital Library

[47]

David Koeplinger, Christina Delimitrou, Raghu Prabhakar, Christos Kozyrakis, Yaqi Zhang, and Kunle Olukotun. 2016. Automatic Generation of Efficient Accelerators for Reconfigurable Hardware. In Proceedings of the 43rd International Symposium on Computer Architecture (Seoul, Republic of Korea) (ISCA ’16). IEEE Press, 115–127. https://doi.org/10.1109/ISCA.2016.20

Digital Library

[48]

David Koeplinger, Matthew Feldman, Raghu Prabhakar, Yaqi Zhang, Stefan Hadjis, Ruben Fiszel, Tian Zhao, Luigi Nardi, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. 2018. Spatial: A Language and Compiler for Application Accelerators. SIGPLAN Not. 53, 4 (jun 2018), 296–311. https://doi.org/10.1145/3296979.3192379

Digital Library

[49]

R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casper, and K. Asanovic. 2004. The vector-thread architecture. In Proceedings. 31st Annual International Symposium on Computer Architecture, 2004. 52–63. https://doi.org/10.1109/ISCA.2004.1310763

[50]

H.T. Kung. 1988. Systolic communication. In [1988] Proceedings. International Conference on Systolic Arrays. 695–703. https://doi.org/10.1109/ARRAYS.1988.18106

[51]

H. T. Kung. 1982. Why systolic architectures? Computer 15 (1982), 37–46.

Digital Library

[52]

E.A. Lee and D.G. Messerschmitt. 1987. Synchronous data flow. Proc. IEEE 75, 9 (1987), 1235–1245. https://doi.org/10.1109/PROC.1987.13876

[53]

Walter Lee, Rajeev Barua, Matthew Frank, Devabhaktuni Srikrishna, Jonathan Babb, Vivek Sarkar, and Saman Amarasinghe. 1998. Space-Time Scheduling of Instruction-Level Parallelism on a Raw Machine. SIGPLAN Not. 33, 11 (oct 1998), 46–57. https://doi.org/10.1145/291006.291018

Digital Library

[54]

Yunsup Lee, Rimas Avizienis, Alex Bishara, Richard Xia, Derek Lockhart, Christopher Batten, and Krste Asanović. 2011. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. In 2011 38th Annual International Symposium on Computer Architecture (ISCA). 129–140.

Digital Library

[55]

Yuan Lin, Hyunseok Lee, M. Woh, Y. Harel, S. Mahlke, T. Mudge, C. Chakrabarti, and K. Flautner. 2006. SODA: A Low-power Architecture For Software Radio. In 33rd International Symposium on Computer Architecture (ISCA’06). 89–101. https://doi.org/10.1109/ISCA.2006.37

Digital Library

[56]

Feng Liu, Soumyadeep Ghosh, Nick P. Johnson, and David I. August. 2014. CGPA: Coarse-Grained Pipelined Accelerators. In Proceedings of the 51st Annual Design Automation Conference (San Francisco, CA, USA) (DAC ’14). Association for Computing Machinery, New York, NY, USA, 1–6. https://doi.org/10.1145/2593069.2593105

Digital Library

[57]

K. Mai, T. Paaske, N. Jayasena, R. Ho, W.J. Dally, and M. Horowitz. 2000. Smart Memories: a modular reconfigurable architecture. In Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201). 161–171. https://doi.org/10.1109/ISCA.2000.854387

[58]

Bingfeng Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins. 2002. DRESC: a retargetable compiler for coarse-grained reconfigurable architectures. In 2002 IEEE International Conference on Field-Programmable Technology, 2002. (FPT). Proceedings. 166–173. https://doi.org/10.1109/FPT.2002.1188678

[59]

Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, and Rudy Lauwereins. 2003. ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix. In Field Programmable Logic and Application, 13th International Conference, FPL 2003, Lisbon, Portugal, September 1-3, 2003, Proceedings (Lecture Notes in Computer Science, Vol. 2778), Peter Y. K. Cheung, George A. Constantinides, and José T. de Sousa (Eds.). Springer, 61–70. https://doi.org/10.1007/978-3-540-45234-8_7

[60]

Jackson Melchert, Kathleen Feng, Caleb Donovick, Ross Daly, Ritvik Sharma, Clark Barrett, Mark A. Horowitz, Pat Hanrahan, and Priyanka Raina. 2023. APEX: A Framework for Automated Processing Element Design Space Exploration Using Frequent Subgraph Analysis. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3 (Vancouver, BC, Canada) (ASPLOS 2023). Association for Computing Machinery, New York, NY, USA, 33–45. https://doi.org/10.1145/3582016.3582070

Digital Library

[61]

O. Menzilcioglu, H.T. Kung, and S.W. Song. 1989. Comprehensive evaluation of a two-dimensional configurable array. In [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers. 93–100. https://doi.org/10.1109/FTCS.1989.105549

[62]

Mirsky and DeHon. 1996. MATRIX: a reconfigurable computing architecture with configurable instruction distribution and deployable resources. In 1996 Proceedings IEEE Symposium on FPGAs for Custom Computing Machines. 157–166. https://doi.org/10.1109/FPGA.1996.564808

[63]

Takashi Miyamori and Kunle Olukotun. 1998. REMARC: Reconfigurable Multimedia Array Coprocessor (Abstract). In Proceedings of the 1998 ACM/SIGDA Sixth International Symposium on Field Programmable Gate Arrays, FPGA 1998, Monterey, CA, USA, February 22-24, 1998, Jason Cong and Sinan Kaptanoglu (Eds.). ACM, 261. https://doi.org/10.1145/275107.275164

Digital Library

[64]

Ramadass Nagarajan, Karthikeyan Sankaralingam, Doug Burger, and Stephen W. Keckler. 2001. A Design Space Evaluation of Grid Processor Architectures. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture (Austin, Texas) (MICRO 34). IEEE Computer Society, USA, 40–51.

[65]

Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, and Karthikeyan Sankaralingam. 2017. Stream-Dataflow Acceleration. In Proceedings of the 44th Annual International Symposium on Computer Architecture (Toronto, ON, Canada) (ISCA ’17). ACM, 416–429.

[66]

Tony Nowatzki, Vinay Gangadhar, and Karthikeyan Sankaralingam. 2015. Exploring the Potential of Heterogeneous von Neumann/Dataflow Execution Models. SIGARCH Comput. Archit. News 43, 3S (jun 2015), 298–310. https://doi.org/10.1145/2872887.2750380

Digital Library

[67]

John Oliver, Ravishankar Rao, Paul Sultana, Jedidiah Crandall, Erik Czernikowski, Leslie W. Jones IV, Diana Franklin, Venkatesh Akella, and Frederic T. Chong. 2004. Synchroscalar: A Multiple Clock Domain, Power-Aware, Tile-Based Embedded Processor. In Proceedings of the 31st Annual International Symposium on Computer Architecture (München, Germany) (ISCA ’04). IEEE Computer Society, USA, 150.

[68]

Sérgio Paiágua, Frederico Pratas, Pedro Tomás, Nuno Roma, and Ricardo Chaves. 2013. HotStream: Efficient Data Streaming of Complex Patterns to Multiple Accelerating Kernels. In 2013 25th International Symposium on Computer Architecture and High Performance Computing. 17–24. https://doi.org/10.1109/SBAC-PAD.2013.17

Digital Library

[69]

Angshuman Parashar, Michael Pellauer, Michael Adler, Bushra Ahsan, Neal Crago, Daniel Lustig, Vladimir Pavlov, Antonia Zhai, Mohit Gambhir, Aamer Jaleel, Randy Allmon, Rachid Rayess, Stephen Maresh, and Joel Emer. 2013. Triggered Instructions: A Control Paradigm for Spatially-Programmed Architectures. SIGARCH Comput. Archit. News 41, 3 (jun 2013), 142–153. https://doi.org/10.1145/2508148.2485935

Digital Library

[70]

Ardavan Pedram, Andreas Gerstlauer, and Robert A. van de Geijn. 2014. Algorithm, Architecture, and Floating-Point Unit Codesign of a Matrix Factorization Accelerator. IEEE Trans. Comput. 63, 8 (2014), 1854–1867. https://doi.org/10.1109/TC.2014.2315627

Digital Library

[71]

Raghu Prabhakar, David Koeplinger, Kevin J. Brown, HyoukJoong Lee, Christopher De Sa, Christos Kozyrakis, and Kunle Olukotun. 2016. Generating Configurable Hardware from Parallel Patterns. SIGPLAN Not. 51, 4 (mar 2016), 651–665. https://doi.org/10.1145/2954679.2872415

Digital Library

[72]

Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matt Feldman, Tian Zhao, Stefan Hadjis, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. 2018. Plasticine: a reconfigurable accelerator for parallel patterns. IEEE Micro 38, 3 (2018), 20–31.

[73]

G. Quenot, C. Coutelle, J. Serot, and B. Zavidovique. 1993. A wavefront array processor for on the fly processing of digital video streams. In Proceedings of International Conference on Application Specific Array Processors (ASAP ’93). 101–108. https://doi.org/10.1109/ASAP.1993.397124

[74]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. SIGPLAN Not. 48, 6 (jun 2013), 519–530. https://doi.org/10.1145/2499370.2462176

Digital Library

[75]

Alexander Rucker, Matthew Vilim, Tian Zhao, Yaqi Zhang, Raghu Prabhakar, and Kunle Olukotun. 2021. Capstan: A Vector RDA for Sparsity. In MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture (Virtual Event, Greece) (MICRO ’21). Association for Computing Machinery, New York, NY, USA, 1022–1035. https://doi.org/10.1145/3466752.3480047

Digital Library

[76]

Richard M. Russell. 1978. The CRAY-1 Computer System. Commun. ACM 21, 1 (jan 1978), 63–72. https://doi.org/10.1145/359327.359336

Digital Library

[77]

Karthikeyan Sankaralingam, Ramadass Nagarajan, Haiming Liu, Changkyu Kim, Jaehyuk Huh, Nitya Ranganathan, Doug Burger, Stephen W. Keckler, Robert G. McDonald, and Charles R. Moore. 2004. TRIPS: A Polymorphous Architecture for Exploiting ILP, TLP, and DLP. 1, 1 (mar 2004), 62–93. https://doi.org/10.1145/980152.980156

Digital Library

[78]

Karthikeyan Sankaralingam, Ramadass Nagarajan, Robert McDonald, Rajagopalan Desikan, Saurabh Drolia, M. S. Govindan, Paul Gratz, Divya Gulati, Heather Hanson, Changkyu Kim, Haiming Liu, Nitya Ranganathan, Simha Sethumadhavan, Sadia Sharif, Premkishore Shivakumar, Stephen W. Keckler, and Doug Burger. 2006. Distributed Microarchitectural Protocols in the TRIPS Prototype Processor (MICRO 39). IEEE Computer Society, USA, 480–491. https://doi.org/10.1109/MICRO.2006.19

Digital Library

[79]

Mahadev Satyanarayanan, Nathan Beckmann, Grace A. Lewis, and Brandon Lucia. 2021. The Role of Edge Offload for Hardware-Accelerated Mobile Devices. In Proceedings of the 22nd International Workshop on Mobile Computing Systems and Applications (Virtual, United Kingdom) (HotMobile ’21). Association for Computing Machinery, New York, NY, USA, 22–29. https://doi.org/10.1145/3446382.3448360

Digital Library

[80]

H. Singh, Ming-Hau Lee, Guangming Lu, F.J. Kurdahi, N. Bagherzadeh, and E.M. Chaves Filho. 2000. MorphoSys: an integrated reconfigurable system for data-parallel and computation-intensive applications. IEEE Trans. Comput. 49, 5 (2000), 465–481. https://doi.org/10.1109/12.859540

Digital Library

[81]

Sander Smets, Toon Goedemé, Anurag Mittal, and Marian Verhelst. 2019. 2.2 A 978GOPS/W Flexible Streaming Processor for Real-Time Image Processing Applications in 22nm FDSOI. In 2019 IEEE International Solid- State Circuits Conference - (ISSCC). 44–46. https://doi.org/10.1109/ISSCC.2019.8662346

[82]

A. Stillmaker and B. Baas. 2017. Scaling equations for the accurate prediction of CMOS device performance from 180 nm to 7 nm. Integration, the VLSI Journal 58 (2017), 74–81. http://vcl.ece.ucdavis.edu/pubs/2017.02.VLSIintegration.TechScale/.

[83]

Steven Swanson, Andrew Schwerin, Martha Mercaldi, Andrew Petersen, Andrew Putnam, Ken Michelson, Mark Oskin, and Susan J. Eggers. 2007. The WaveScalar Architecture. ACM Trans. Comput. Syst. 25, 2, Article 4 (may 2007), 54 pages. https://doi.org/10.1145/1233307.1233308

Digital Library

[84]

Cheng Tan, Nicolas Bohm Agostini, Tong Geng, Chenhao Xie, Jiajia Li, Ang Li, Kevin J. Barker, and Antonino Tumeo. 2022. DRIPS: Dynamic Rebalancing of Pipelined Streaming Applications on CGRAs. In 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 304–316. https://doi.org/10.1109/HPCA53966.2022.00030

[85]

William Thies, Michal Karczmarek, and Saman P. Amarasinghe. 2002. StreamIt: A Language for Streaming Applications. In Proceedings of the 11th International Conference on Compiler Construction (CC ’02). Springer-Verlag, Berlin, Heidelberg, 179–196.

Digital Library

[86]

Christopher Torng, Peitian Pan, Yanghui Ou, Cheng Tan, and Christopher Batten. 2021. Ultra-Elastic CGRAs for Irregular Loop Specialization. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 412–425. https://doi.org/10.1109/HPCA51647.2021.00042

[87]

Anh T. Tran, Dean N. Truong, and Bevan M. Baas. 2008. A complete real-time 802.11a baseband receiver implemented on an array of programmable processors. In 2008 42nd Asilomar Conference on Signals, Systems and Computers. 165–170. https://doi.org/10.1109/ACSSC.2008.5074384

[88]

Dean N. Truong, Wayne H. Cheng, Tinoosh Mohsenin, Zhiyi Yu, Anthony T. Jacobson, Gouri Landge, Michael J. Meeuwsen, Christine Watnik, Anh T. Tran, Zhibin Xiao, Eric W. Work, Jeremy W. Webb, Paul V. Mejia, and Bevan M. Baas. 2009. A 167-Processor Computational Platform in 65 nm CMOS. IEEE Journal of Solid-State Circuits 44, 4 (2009), 1130–1144. https://doi.org/10.1109/JSSC.2009.2013772

[89]

Kizheppatt Vipin and Suhaib A. Fahmy. 2018. FPGA Dynamic and Partial Reconfiguration: A Survey of Architectures, Methods, and Applications. ACM Comput. Surv. 51, 4, Article 72 (jul 2018), 39 pages. https://doi.org/10.1145/3193827

Digital Library

[90]

Markus Voelter. 2021. Programming vs. That Thing Subject Matter Experts Do. In Leveraging Applications of Formal Methods, Verification and Validation: 10th International Symposium on Leveraging Applications of Formal Methods, ISoLA 2021, Rhodes, Greece, October 17–29, 2021, Proceedings (Rhodes, Greece). Springer-Verlag, Berlin, Heidelberg, 414–425. https://doi.org/10.1007/978-3-030-89159-6_26

Digital Library

[91]

Zhengrong Wang and Tony Nowatzki. 2019. Stream-Based Memory Access Specialization for General Purpose Processors. In Proceedings of the 46th International Symposium on Computer Architecture (Phoenix, Arizona) (ISCA ’19). Association for Computing Machinery, New York, NY, USA, 736–749. https://doi.org/10.1145/3307650.3322229

Digital Library

[92]

Zhengrong Wang, Jian Weng, Sihao Liu, and Tony Nowatzki. 2022. Near-Stream Computing: General and Transparent Near-Cache Acceleration. In IEEE International Symposium on High-Performance Computer Architecture, HPCA 2022, Seoul, South Korea, April 2-6, 2022. IEEE, 331–345. https://doi.org/10.1109/HPCA53966.2022.00032

[93]

Jian Weng, Sihao Liu, Vidushi Dadu, Zhengrong Wang, Preyas Shah, and Tony Nowatzki. 2020. DSAGEN: Synthesizing Programmable Spatial Accelerators. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 268–281. https://doi.org/10.1109/ISCA45697.2020.00032

Digital Library

[94]

Jian Weng, Sihao Liu, Zhengrong Wang, Vidushi Dadu, and Tony Nowatzki. 2020. A Hybrid Systolic-Dataflow Architecture for Inductive Matrix Algorithms. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). 703–716. https://doi.org/10.1109/HPCA47549.2020.00063

[95]

Dhananjaya Wijerathne, Zhaoying Li, Manupa Karunarathne, Anuj Pathania, and Tulika Mitra. 2019. CASCADE: High Throughput Data Streaming via Decoupled Access-Execute CGRA. ACM Trans. Embed. Comput. Syst. 18, 5s, Article 50 (oct 2019), 26 pages. https://doi.org/10.1145/3358177

Digital Library

[96]

Mark Wijtvliet, Luc Waeijen, and Henk Corporaal. 2016. Coarse grained reconfigurable architectures in the past 25 years: Overview and classification. In 2016 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS). 235–244. https://doi.org/10.1109/SAMOS.2016.7818353

[97]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An Insightful Visual Performance Model for Multicore Architectures. Commun. ACM 52, 4 (apr 2009), 65–76. https://doi.org/10.1145/1498765.1498785

Digital Library

[98]

Wittig and Chow. 1996. OneChip: an FPGA processor with reconfigurable logic. In 1996 Proceedings IEEE Symposium on FPGAs for Custom Computing Machines. 126–135. https://doi.org/10.1109/FPGA.1996.564773

[99]

Mark Woh, Sangwon Seo, Scott Mahlke, Trevor Mudge, Chaitali Chakrabarti, and Krisztian Flautner. 2009. AnySP: Anytime Anywhere Anyway Signal Processing. In Proceedings of the 36th Annual International Symposium on Computer Architecture (Austin, TX, USA) (ISCA ’09). Association for Computing Machinery, New York, NY, USA, 128–139. https://doi.org/10.1145/1555754.1555773

Digital Library

[100]

Fahimeh Yazdanpanah, Carlos Alvarez-Martinez, Daniel Jimenez-Gonzalez, and Yoav Etsion. 2014. Hybrid Dataflow/von-Neumann Architectures. IEEE Transactions on Parallel and Distributed Systems 25, 6 (2014), 1489–1509. https://doi.org/10.1109/TPDS.2013.125

Digital Library

[101]

A.K. Yeung and J.M. Rabaey. 1995. A 2.4 GOPS data-driven reconfigurable multiprocessor IC for DSP. In Proceedings ISSCC ’95 - International Solid-State Circuits Conference. 108–109. https://doi.org/10.1109/ISSCC.1995.535451

[102]

Zhiyi Yu, M. Meeuwsen, R. Apperson, O. Sattari, M. Lai, J. Webb, E. Work, T. Mohsenin, M. Singh, and B. Baas. 2006. An asynchronous array of simple processors for dsp applications. In 2006 IEEE International Solid State Circuits Conference - Digest of Technical Papers. 1696–1705. https://doi.org/10.1109/ISSCC.2006.1696225

[103]

Zhiyi Yu, Michael J. Meeuwsen, Ryan W. Apperson, Omar Sattari, Michael Lai, Jeremy W. Webb, Eric W. Work, Dean Truong, Tinoosh Mohsenin, and Bevan M. Baas. 2008. AsAP: An Asynchronous Array of Simple Processors. IEEE Journal of Solid-State Circuits 43, 3 (2008), 695–705. https://doi.org/10.1109/JSSC.2007.916616

[104]

Fang-Li Yuan and Dejan Marković. 2014. A 13.1GOPS/mW 16-core processor for software-defined radios in 40nm CMOS. In 2014 Symposium on VLSI Circuits Digest of Technical Papers. 1–2. https://doi.org/10.1109/VLSIC.2014.6858388

[105]

H. Zhang, V. Prabhu, V. George, M. Wan, M. Benes, A. Abnous, and J.M. Rabaey. 2000. A 1-V heterogeneous reconfigurable DSP IC for wireless baseband digital signal processing. IEEE Journal of Solid-State Circuits 35, 11 (2000), 1697–1704. https://doi.org/10.1109/4.881217

Index Terms

Recommendations

A holistic approach to build real-time stream processing system with GPU

Stream processing needs to process huge volume of data with strict deadline requirements. These applications generally consume large amount of network bandwidth and involve compute-intensive operations. Accelerating such operations with general purpose ...
Real-time Visual Tracker by Stream Processing

In this work, we implement a real-time visual tracker that targets the position and 3D pose of objects in video sequences, specifically faces. The use of stream processors for the computations and efficient Sparse-Template-based particle filtering ...
Design and implementation of stream processing system and library for CELL broadband engine processors
PDCS '07: Proceedings of the 19th IASTED International Conference on Parallel and Distributed Computing and Systems

To bring out high performance in applications running on a CELL Broadband Engine processor (CELL processor), developers have to know its architecture and have special skills of the programming. As we know, the CELL processor is suitable for stream ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Reconfigurable Technology and Systems

ACM Transactions on Reconfigurable Technology and Systems Just Accepted

EISSN:1936-7414

Table of Contents

Copyright © 2024 Copyright held by the owner/author(s).

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Online AM: 18 September 2024

Accepted: 28 August 2024

Revised: 18 July 2024

Received: 20 January 2024

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables