research-article

Public Access

PartLy: learning data partitioning for distributed data stream processing

Authors:

Ahmed S. Abdelhamid,

Walid G. ArefAuthors Info & Claims

aiDM '20: Proceedings of the Third International Workshop on Exploiting Artificial Intelligence Techniques for Data Management

Article No.: 6, Pages 1 - 4

https://doi.org/10.1145/3401071.3401660

Published: 14 June 2020 Publication History

Abstract

Data partitioning plays a critical role in data stream processing. Current data partitioning techniques use simple, static heuristics that do not incorporate feedback about the quality of the partitioning decision (i.e., fire and forget strategy). Hence, the data partitioner often repeatedly chooses the same decision. In this paper, we argue that reinforcement learning techniques can be applied to address this problem. The use of artificial neural networks can facilitate learning of efficient partitioning policies. We identify the challenges that emerge when applying machine learning techniques to the data partitioning problem for distributed data stream processing. Furthermore, we introduce PartLy, a proof-of-concept data partitioner, and present preliminary results that indicate PartLy's potential to match the performance of state-of-the-art techniques in terms of partitioning quality, while minimizing storage and processing overheads.

References

[1]

Prompt: Online data-partitioning for distributed micro-batch streaming systems. In Sigmod, 2020.

[2]

A. K. et. al. Brief survey of drl. In IEEE Signal Processing, 2017.

[3]

S. J. et al. Proximal policy optimization algorithms. In arXiv, 17.

[4]

S. M. et. al. Tensorforce: A tensorflow library for applied reinforcement learning. In https://github.com/reinforceio/tensorforce.

[5]

N. R. Katsipoulakis, A. Labrinidis, and P. K. Chrysanthis. A holistic view of stream partitioning costs. In VLDB, 2017.

Digital Library

[6]

A. A. B. Lima, M. Mattoso, and P. Valduriez. Adaptive virtual partitioning for olap query processing in a database cluster. In Journal of Information and Data Management, volume 1, pages 75--87, 2010.

[7]

M. Liroz-Gistau, R. Akbarinia, E. Pacitti, F. Porto, and P. Valduriez. Dynamic workload-based partitioning for large-scale databases. In DEXA, pages 183--190, 2012.

[8]

H. Mao, M. Schwarzkopf, S. B. Venkatakrishnan, Z. Meng, and M. Alizadeh. Learning scheduling algorithms for data processing clusters. In SIGCOMM, 2019.

Digital Library

[9]

H. Mao, S. B. Venkatakrishnan, M. Schwarzkopf, and M. Alizadeh. Variance reduction for reinforcement learn- ing in input-driven environments. In ICLR, 2019.

[10]

R. Marcus, P. Negi, H. Mao, C. Zhang, M. Alizadeh, T. Kraska, O. Papaemmanouil, and N. Tatbul. Neo: A learned query optimizer. In arXiv, 2018.

[11]

R. Marcus and O. Papaemmanouil. Deep reinforcement learning for join order enumeration. In aiDM, 2018.

[12]

M. A. U. Nasir, G. D. F. Morales, N. Kourtellis, and M. Serafini. When two choices are not enough: Balancing at scale in distributed stream processing. In ICDE, 2016.

[13]

M. A. U. Nasir, G. D. F. Morales, D. G. Soriano, N. Kourtellis, and M. Serafini. The power of both choices: Practical load balancing for distributed stream processing engines. In ICDE, 2015.

[14]

S. Venkataraman, A. Panda, K. Ousterhout, M. Armbrust, A. Ghodsi, M. J. Franklin, B. Recht, and I. Stoica. Drizzle: Fast and adaptable stream processing at scale. In SOSP, 2017.

Digital Library

[15]

M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized streams: Fault-tolerant streaming computation at scale. In SOSP, 2013.

Digital Library

Cited By

Liu GWang ZZhou AMao R(2024)Adaptive key partitioning in distributed stream processingCCF Transactions on High Performance Computing10.1007/s42514-023-00179-36:2(164-178)Online publication date: 12-Jan-2024
https://doi.org/10.1007/s42514-023-00179-3
Aslam ASimonini GGagliardelli LMozzillo ABergamaschi S(2023)HKS: Efficient Data Partitioning for Stateful StreamingBig Data Analytics and Knowledge Discovery10.1007/978-3-031-39831-5_35(386-391)Online publication date: 10-Aug-2023
https://doi.org/10.1007/978-3-031-39831-5_35
Zapridou EMytilinis IAilamaki A(2022)DaltonProceedings of the VLDB Endowment10.14778/3570690.357069916:3(491-504)Online publication date: 1-Nov-2022
https://dl.acm.org/doi/10.14778/3570690.3570699
Show More Cited By

Recommendations

Multiagent reinforcement learning with the partly high-dimensional state space

One method of designing a multiagent system is called multiagent reinforcement learning. In multiagent reinforcement learning, an agent also observes the other agents as part of the environment. As a result, as the number of agents increases, the state ...
Study On Purchase Intention In Different Live Streaming Scenarios Based On Experimental Approach
ICEBI '22: Proceedings of the 2022 6th International Conference on E-Business and Internet

Live streaming e-commerce has exploded recently. While the live streaming traffic is dominated by the top live streamers, merchants and ordinary live streamers attempt to establish self-operating live streaming, but the number of fans and sales ...
MedSMan: a live multimedia stream querying system

Querying live media streams is a challenging problem that is becoming an essential requirement in a growing number of applications. Research in multimedia information systems has addressed and made good progress in dealing with archived data. Meanwhile, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

aiDM '20: Proceedings of the Third International Workshop on Exploiting Artificial Intelligence Techniques for Data Management

June 2020

33 pages

ISBN:9781450380294

DOI:10.1145/3401071

Conference Chairs:
Rajesh Bordawekar
IBM T. J. Watson Research Center
,
Oded Shmueli
Technion
,
Nesime Tatbul
MIT and Intel Labs
,
Tin Kam Ho
IBM

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

National Science Foundation

Conference

SIGMOD/PODS '20

Sponsor:

SIGMOD

SIGMOD/PODS '20: International Conference on Management of Data

June 14 - 20, 2020

Oregon, Portland

Acceptance Rates

aiDM '20 Paper Acceptance Rate 6 of 6 submissions, 100%;

Overall Acceptance Rate 19 of 26 submissions, 73%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
346
Total Downloads

Downloads (Last 12 months)103
Downloads (Last 6 weeks)19

Reflects downloads up to 15 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Liu GWang ZZhou AMao R(2024)Adaptive key partitioning in distributed stream processingCCF Transactions on High Performance Computing10.1007/s42514-023-00179-36:2(164-178)Online publication date: 12-Jan-2024
https://doi.org/10.1007/s42514-023-00179-3
Aslam ASimonini GGagliardelli LMozzillo ABergamaschi S(2023)HKS: Efficient Data Partitioning for Stateful StreamingBig Data Analytics and Knowledge Discovery10.1007/978-3-031-39831-5_35(386-391)Online publication date: 10-Aug-2023
https://doi.org/10.1007/978-3-031-39831-5_35
Zapridou EMytilinis IAilamaki A(2022)DaltonProceedings of the VLDB Endowment10.14778/3570690.357069916:3(491-504)Online publication date: 1-Nov-2022
https://dl.acm.org/doi/10.14778/3570690.3570699
Aslam AChen HJin H(2021)Pre‐filtering based summarization for data partitioning in distributed stream processingConcurrency and Computation: Practice and Experience10.1002/cpe.633833:20Online publication date: 30-Apr-2021
https://doi.org/10.1002/cpe.6338

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents