skip to main content
10.1145/3369583.3392678acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Towards HPC I/O Performance Prediction through Large-scale Log Analysis

Published: 23 June 2020 Publication History

Abstract

Large-scale high performance computing (HPC) systems typically consist of many thousands of CPUs and storage units, while used by hundreds to thousands of users at the same time. Applications from these large numbers of users have diverse characteristics, such as varying compute, communication, memory, and I/O intensiveness. A good understanding of the performance characteristics of each user application is important for job scheduling and resource provisioning. Among these performance characteristics, the I/O performance is difficult to predict because the I/O system software is complex, the I/O system is shared among all users, and the I/O operations also heavily rely on networking systems. To improve the prediction of the I/O performance on HPC systems, we propose to integrate information from a number of different system logs and develop a regression-based approach that dynamically selects the most relevant features from the most recent log entries, and automatically select the best regression algorithm for the prediction task. Evaluation results show that our proposed scheme can predict the I/O performance with up to 84% prediction accuracy in the case of the I/O-intensive applications using the logs from CORI supercomputer at NERSC.

Supplementary Material

MP4 File (3369583.3392678.mp4)
Presentation video

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation ( OSDI 16). 265-- 283.
[2]
Babak Behzad, Surendra Byna, Prabhat, and Marc Snir. 2015. Pattern-driven Parallel I/O Tuning. In Proceedings of the 10th Parallel Data Storage Workshop (PDSW '15). ACM, New York, NY, USA, 43--48. https://doi.org/10.1145/2834976. 2834977
[3]
Babak Behzad, Surendra Byna, Stefan M. Wild, Mr. Prabhat, and Marc Snir. 2014. Improving Parallel I/O Autotuning with Performance Modeling. In Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing (HPDC '14). Association for Computing Machinery, New York, NY, USA, 253--256. https://doi.org/10.1145/2600212.2600708
[4]
IOR Benchmark. 2020. https://asc. llnl. gov/sequoia/benchmarks/IOR summary v1. 0. pdf. Accessed January 5 (2020).
[5]
Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. 2009. Pearson correlation coefficient. In Noise reduction in speech processing. Springer, 1--4.
[6]
Philip Carns, Robert Latham, Robert Ross, Kamil Iskra, Samuel Lang, and Kather- ine Riley. 2009. 24/7 characterization of petascale I/O workloads. In 2009 IEEE International Conference on Cluster Computing and Workshops. IEEE, 1--10.
[7]
François Chollet et al. 2015. Keras. https://keras.io.
[8]
Sahibsingh A Dudani. 1976. The distance-weighted k-nearest-neighbor rule. IEEE Transactions on Systems, Man, and Cybernetics 4 (1976), 325--327.
[9]
Sunggon Kim, Alex Sim, Kesheng Wu, Suren Byna, Teng Wang, Yongseok Son, and Hyeonsang Eom. 2019. DCA-IO: A Dynamic I/O Control Scheme for Parallel and Distributed File Systems. In 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). 351--360.
[10]
Kenji Kira and Larry A Rendell. 1992. A practical approach to feature selection. In Machine Learning Proceedings 1992. Elsevier, 249--256.
[11]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifica- tion with deep convolutional neural networks. In Advances in neural information processing systems. 1097--1105.
[12]
Tom M Kroeger and Darrell DE Long. 1999. The case for efficient file access pat- tern modeling. In Proceedings of the Seventh Workshop on Hot Topics in Operating Systems. IEEE, 14--19.
[13]
Samuel Lang, Philip Carns, Robert Latham, Robert Ross, Kevin Harms, and William Allcock. 2009. I/O performance challenges at leadership scale. In Pro- ceedings of the Conference on High Performance Computing Networking, Storage and Analysis. IEEE, 1--12.
[14]
Andy Liaw, Matthew Wiener, et al. 2002. Classification and regression by ran- domForest. R news 2, 3 (2002), 18--22.
[15]
Glenn K Lockwood, Shane Snyder, Teng Wang, Suren Byna, Philip Carns, and Nicholas J Wright. 2018. A year in the life of a parallel file system. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. IEEE Press, 74.
[16]
Glenn K Lockwood, Nicholas J Wright, Shane Snyder, Philip Carns, George Brown, and Kevin Harms. 2018. TOKIO on ClusterStor: connecting standard tools to enable holistic I/O performance analysis. (2018).
[17]
Thomas CH Lux, Layne T Watson, Tyler H Chang, Jon Bernard, Bo Li, Li Xu, Godmar Back, Ali R Butt, Kirk W Cameron, and Yili Hong. 2018. Predictive modeling of I/O characteristics in high performance computing systems. In Proceedings of the High Performance Computing Symposium. Society for Computer Simulation International, 8.
[18]
Andréa Matsunaga and José AB Fortes. 2010. On the use of machine learning to predict the time and resources consumed by applications. In 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing. IEEE, 495--504.
[19]
Ryan McKenna, Stephen Herbein, Adam Moody, Todd Gamblin, and Michela Taufer. 2016. Machine learning predictions of runtime and IO traffic on high-end clusters. In 2016 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 255--258.
[20]
Wes McKinney. 2010. Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference, Stéfan van der Walt and Jarrod Millman (Eds.). 51 -- 56.
[21]
Changwoo Min, Kangnyeon Kim, Hyunjin Cho, Sang-Won Lee, and Young Ik Eom. 2012. SFS: random write considered harmful in solid state drives. In FAST, Vol. 12. 1--16.
[22]
Amir Navot, Ran Gilad-Bachrach, Yiftah Navot, and Naftali Tishby. 2005. Is feature selection still necessary?. In International Statistical and Optimization Perspectives Workshop" Subspace, Latent Structure and Feature Selection". Springer, 127--138.
[23]
Sankar K Pal and Sushmita Mitra. 1992. Multilayer perceptron, fuzzy sets, and classification. IEEE Transactions on neural networks 3, 5 (1992), 683--697.
[24]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. Journal of machine learning research 12, Oct (2011), 2825--2830.
[25]
Gregory F Pfister. 2001. An introduction to the infiniband architecture. High Performance Mass Storage and Parallel I/O 42 (2001), 617--632.
[26]
Dino Quintero, Luis Bolinches, Puneet Chaudhary, Willard Davis, Steve Duersch, Carlos Henrique Fachim, Andrei Socoliuc, Olaf Weiser, et al. 2017. IBM Spectrum Scale (formerly GPFS). IBM Redbooks.
[27]
Calyampudi Radhakrishna Rao, Calyampudi Radhakrishna Rao, Mathematischer Statistiker, Calyampudi Radhakrishna Rao, and Calyampudi Radhakrishna Rao. 1973. Linear statistical inference and its applications. Vol. 2. Wiley New York.
[28]
Jan F Schmidt and Julian M Kunkel. 2016. Predicting I/O performance in HPC using artificial neural networks. Supercomputing Frontiers and Innovations 3, 3 (2016), 19--33.
[29]
Philip Schwan et al. 2003. Lustre: Building a file system for 1000-node clusters. In Proceedings of the 2003 Linux symposium, Vol. 2003. 380--386.
[30]
Hongzhang Shan, Katie Antypas, and John Shalf. 2008. Characterizing and pre- dicting the I/O performance of HPC applications using a parameterized synthetic benchmark. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing. IEEE Press, 42.
[31]
Shane Snyder, Philip Carns, Kevin Harms, Robert Ross, Glenn K Lockwood, and Nicholas J Wright. 2016. Modular hpc i/o characterization with darshan. In 2016 5th Workshop on Extreme-Scale Programming Tools (ESPT). IEEE, 9--17.
[32]
Shivaram Venkataraman, Zongheng Yang, Michael Franklin, Benjamin Recht, and Ion Stoica. 2016. Ernest: efficient performance prediction for large-scale advanced analytics. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16). 363--378.
[33]
Jakob J Verbeek, Nikos Vlassis, and Ben Kröse. 2003. Efficient greedy learning of Gaussian mixture models. Neural computation 15, 2 (2003), 469--485.
[34]
Teng Wang, Shane Snyder, Glenn Lockwood, Philip Carns, Nicholas Wright, and Suren Byna. 2018. IOMiner: Large-Scale Analytics Framework for Gaining Knowledge from I/O Logs. In 2018 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 466--476.
[35]
CM Herb Wartens and Jim Garlick. 2010. LMT-The Lustre Monitoring Tool.
[36]
Bing Xie, Yezhou Huang, Jeffrey S Chase, Jong Youl Choi, Scott Klasky, Jay Lofstead, and Sarp Oral. 2017. Predicting output performance of a petascale supercomputer. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing. 181--192.
[37]
Andy B Yoo, Morris A Jette, and Mark Grondona. 2003. Slurm: Simple linux utility for resource management. In Workshop on Job Scheduling Strategies for Parallel Processing. Springer, 44--60.

Cited By

View all
  • (2024)Tarazu: An Adaptive End-to-end I/O Load-balancing Framework for Large-scale Parallel File SystemsACM Transactions on Storage10.1145/364188520:2(1-42)Online publication date: 1-Feb-2024
  • (2024)A2FL: Autonomous and Adaptive File Layout in HPC through Real-time Access Pattern Analysis2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00051(506-518)Online publication date: 27-May-2024
  • (2024)Relative Performance Prediction Using Few-Shot Learning2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC61105.2024.00278(1764-1769)Online publication date: 2-Jul-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HPDC '20: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing
June 2020
246 pages
ISBN:9781450370523
DOI:10.1145/3369583
© 2020 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. I/O performance prediction
  2. distributed file system
  3. high performance computing
  4. log analysis

Qualifiers

  • Research-article

Funding Sources

  • the Office of Advanced Scientific Computing Research, Office of Science, of the U.S. Department of Energy under Contract
  • the National Research Foundation of Korea (NRF)

Conference

HPDC '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)175
  • Downloads (Last 6 weeks)18
Reflects downloads up to 15 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Tarazu: An Adaptive End-to-end I/O Load-balancing Framework for Large-scale Parallel File SystemsACM Transactions on Storage10.1145/364188520:2(1-42)Online publication date: 1-Feb-2024
  • (2024)A2FL: Autonomous and Adaptive File Layout in HPC through Real-time Access Pattern Analysis2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00051(506-518)Online publication date: 27-May-2024
  • (2024)Relative Performance Prediction Using Few-Shot Learning2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC61105.2024.00278(1764-1769)Online publication date: 2-Jul-2024
  • (2024)Olsync: Object-level tiering and coordination in tiered storage systems based on software-defined networkFuture Generation Computer Systems10.1016/j.future.2024.107521(107521)Online publication date: Sep-2024
  • (2023)Design and implementation of I/O performance prediction scheme on HPC systems through large-scale log analysisJournal of Big Data10.1186/s40537-023-00741-410:1Online publication date: 17-May-2023
  • (2023)I/O Access Patterns in HPC Applications: A 360-Degree SurveyACM Computing Surveys10.1145/361100756:2(1-41)Online publication date: 15-Sep-2023
  • (2023)I/O Burst Prediction for HPC Clusters Using Darshan Logs2023 IEEE 19th International Conference on e-Science (e-Science)10.1109/e-Science58273.2023.10254871(1-10)Online publication date: 9-Oct-2023
  • (2023)IO-Sets: Simple and Efficient Approaches for I/O Bandwidth ManagementIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.330502834:10(2783-2796)Online publication date: Oct-2023
  • (2023)Optimizing Logging and Monitoring in Heterogeneous Cloud Environments for IoT and Edge ApplicationsIEEE Internet of Things Journal10.1109/JIOT.2023.330437310:24(22611-22622)Online publication date: 15-Dec-2023
  • (2023)IOScout: an I/O Characteristics Prediction Method for the Supercomputer Jobs2023 IEEE 3rd International Conference on Computer Communication and Artificial Intelligence (CCAI)10.1109/CCAI57533.2023.10201270(205-210)Online publication date: 26-May-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media