skip to main content
research-article

Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems

Published: 01 January 2022 Publication History

Abstract

To accelerate the training of Deep Learning (DL) models, clusters of machines equipped with hardware accelerators such as GPUs are leveraged to reduce execution time. State-of-the-art resource managers are needed to increase GPU utilization and maximize throughput. While co-locating DL jobs on the same GPU has been shown to be effective, this can incur interference causing slowdown. In this article we propose Horus: an interference-aware and prediction-based resource manager for DL systems. Horus proactively predicts GPU utilization of heterogeneous DL jobs extrapolated from the DL model’s computation graph features, removing the need for online profiling and isolated reserved GPUs. Through micro-benchmarks and job co-location combinations across heterogeneous GPU hardware, we identify GPU utilization as a general proxy metric to determine good placement decisions, in contrast to current approaches which reserve isolated GPUs to perform online profiling and directly measure GPU utilization for each unique submitted job. Our approach promotes high resource utilization and makespan reduction; via real-world experimentation and large-scale trace driven simulation, we demonstrate that Horus outperforms other DL resource managers by up to 61.5 percent for GPU resource utilization, 23.7–30.7 percent for makespan reduction and 68.3 percent in job wait time reduction.

Cited By

View all
  • (2024)Parallel Task Scheduling in Autonomous Robotic Systems: An Event-Driven Multimodal Prediction ApproachProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673147(742-751)Online publication date: 12-Aug-2024
  • (2024)An Analysis of Collocation on GPUs for Deep Learning TrainingProceedings of the 4th Workshop on Machine Learning and Systems10.1145/3642970.3655827(81-90)Online publication date: 22-Apr-2024
  • (2024)ETS: Deep Learning Training Iteration Time Prediction based on Execution Trace Sliding WindowProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658658(56-68)Online publication date: 3-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems
IEEE Transactions on Parallel and Distributed Systems  Volume 33, Issue 1
Jan. 2022
136 pages

Publisher

IEEE Press

Publication History

Published: 01 January 2022

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Parallel Task Scheduling in Autonomous Robotic Systems: An Event-Driven Multimodal Prediction ApproachProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673147(742-751)Online publication date: 12-Aug-2024
  • (2024)An Analysis of Collocation on GPUs for Deep Learning TrainingProceedings of the 4th Workshop on Machine Learning and Systems10.1145/3642970.3655827(81-90)Online publication date: 22-Apr-2024
  • (2024)ETS: Deep Learning Training Iteration Time Prediction based on Execution Trace Sliding WindowProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658658(56-68)Online publication date: 3-Jun-2024
  • (2024)DProbe: Profiling and Predicting Multi-tenant Deep Learning Workloads for GPU Resource ScalingEuro-Par 2024: Parallel Processing10.1007/978-3-031-69577-3_17(239-253)Online publication date: 26-Aug-2024
  • (2023)Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPUProceedings of the 21st ACM Conference on Embedded Networked Sensor Systems10.1145/3625687.3625789(97-110)Online publication date: 12-Nov-2023
  • (2023)Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training JobsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575705(457-472)Online publication date: 27-Jan-2023
  • (2023)DOPpler: Parallel Measurement Infrastructure for Auto-Tuning Deep Learning Tensor ProgramsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.327923334:7(2208-2220)Online publication date: 1-Jul-2023
  • (2023)BisSiam: Bispectrum Siamese Network Based Contrastive Learning for UAV Anomaly DetectionIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.311872735:12(12109-12124)Online publication date: 1-Dec-2023
  • (2023)TBDB: Token Bucket-Based Dynamic Batching for Resource Scheduling Supporting Neural Network Inference in Intelligent Consumer ElectronicsIEEE Transactions on Consumer Electronics10.1109/TCE.2023.333963370:1(1134-1144)Online publication date: 5-Dec-2023
  • (2023)Prediction-based scheduling techniques for cloud data center’s workload: a systematic reviewCluster Computing10.1007/s10586-023-04024-826:5(3209-3235)Online publication date: 18-May-2023
  • Show More Cited By

View Options

View options

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media