research-article

Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems

Authors:

Peter GarraghanAuthors Info & Claims

IEEE Transactions on Parallel and Distributed Systems, Volume 33, Issue 1

Pages 88 - 100

https://doi.org/10.1109/TPDS.2021.3079202

Published: 01 January 2022 Publication History

Abstract

To accelerate the training of Deep Learning (DL) models, clusters of machines equipped with hardware accelerators such as GPUs are leveraged to reduce execution time. State-of-the-art resource managers are needed to increase GPU utilization and maximize throughput. While co-locating DL jobs on the same GPU has been shown to be effective, this can incur interference causing slowdown. In this article we propose Horus: an interference-aware and prediction-based resource manager for DL systems. Horus proactively predicts GPU utilization of heterogeneous DL jobs extrapolated from the DL model’s computation graph features, removing the need for online profiling and isolated reserved GPUs. Through micro-benchmarks and job co-location combinations across heterogeneous GPU hardware, we identify GPU utilization as a general proxy metric to determine good placement decisions, in contrast to current approaches which reserve isolated GPUs to perform online profiling and directly measure GPU utilization for each unique submitted job. Our approach promotes high resource utilization and makespan reduction; via real-world experimentation and large-scale trace driven simulation, we demonstrate that Horus outperforms other DL resource managers by up to 61.5 percent for GPU resource utilization, 23.7–30.7 percent for makespan reduction and 68.3 percent in job wait time reduction.

Cited By

View all

Gao WYu ZXiong HGuo BWang LYao Y(2024)Parallel Task Scheduling in Autonomous Robotic Systems: An Event-Driven Multimodal Prediction ApproachProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673147(742-751)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673147
Robroek TYousefzadeh-Asl-Miandoab ETözün P(2024)An Analysis of Collocation on GPUs for Deep Learning TrainingProceedings of the 4th Workshop on Machine Learning and Systems10.1145/3642970.3655827(81-90)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3642970.3655827
Yang ZGuo HWu HWu YZhong HZhang WZhou CLiu YMencagli GDazzi PLowenthal DBadia R(2024)ETS: Deep Learning Training Iteration Time Prediction based on Execution Trace Sliding WindowProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658658(56-68)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658658
Show More Cited By

Recommendations

Horus: An Interference-Aware Resource Manager for Deep Learning Systems
Algorithms and Architectures for Parallel Processing
Abstract
Deep Learning (DL) models are deployed as jobs within machines containing GPUs. These DL systems - ranging from a singular GPU device to machine clusters - require state-of-the-art resource management to increase resource utilization and job ...
Scheduling CPU for GPU-based Deep Learning Jobs
SoCC '18: Proceedings of the ACM Symposium on Cloud Computing

Deep learning (DL) is popular in data-center as an important workload for artificial intelligence. With the recent breakthrough of using graphics accelerators and the popularity of DL framework, GPU server cluster dominates DL training in current ...
Hydra: Deadline-Aware and Efficiency-Oriented Scheduling for Deep Learning Jobs on Heterogeneous GPUs
With the rapid proliferation of deep learning (DL) jobs running on heterogeneous GPUs, scheduling DL jobs to meet various scheduling requirements, such as meeting deadlines and reducing job completion time (JCT), is critical. Unfortunately, existing ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems

IEEE Transactions on Parallel and Distributed Systems Volume 33, Issue 1

Jan. 2022

136 pages

ISSN:1045-9219

Issue’s Table of Contents

Publisher

IEEE Press

Publication History

Published: 01 January 2022

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Gao WYu ZXiong HGuo BWang LYao Y(2024)Parallel Task Scheduling in Autonomous Robotic Systems: An Event-Driven Multimodal Prediction ApproachProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673147(742-751)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673147
Robroek TYousefzadeh-Asl-Miandoab ETözün P(2024)An Analysis of Collocation on GPUs for Deep Learning TrainingProceedings of the 4th Workshop on Machine Learning and Systems10.1145/3642970.3655827(81-90)Online publication date: 22-Apr-2024
https://dl.acm.org/doi/10.1145/3642970.3655827
Yang ZGuo HWu HWu YZhong HZhang WZhou CLiu YMencagli GDazzi PLowenthal DBadia R(2024)ETS: Deep Learning Training Iteration Time Prediction based on Execution Trace Sliding WindowProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658658(56-68)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3625549.3658658
Zhou ZSun JMei HSun PSun G(2024)DProbe: Profiling and Predicting Multi-tenant Deep Learning Workloads for GPU Resource ScalingEuro-Par 2024: Parallel Processing10.1007/978-3-031-69577-3_17(239-253)Online publication date: 26-Aug-2024
https://dl.acm.org/doi/10.1007/978-3-031-69577-3_17
Zhao ZLing NGuan NXing GEskicioglu RHuang PPatwari N(2023)Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPUProceedings of the 21st ACM Conference on Embedded Networked Sensor Systems10.1145/3625687.3625789(97-110)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3625687.3625789
Hu QZhang MSun PWen YZhang TAamodt TJerger NSwift M(2023)Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training JobsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575705(457-472)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575705
Borowiec DYeung GFriday AHarper RGarraghan P(2023)DOPpler: Parallel Measurement Infrastructure for Auto-Tuning Deep Learning Tensor ProgramsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.327923334:7(2208-2220)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3279233
Li THong ZCai QYu LWen ZYang R(2023)BisSiam: Bispectrum Siamese Network Based Contrastive Learning for UAV Anomaly DetectionIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.311872735:12(12109-12124)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1109/TKDE.2021.3118727
Gao HQiu BWang YYu SXu YWang X(2023)TBDB: Token Bucket-Based Dynamic Batching for Resource Scheduling Supporting Neural Network Inference in Intelligent Consumer ElectronicsIEEE Transactions on Consumer Electronics10.1109/TCE.2023.333963370:1(1134-1144)Online publication date: 5-Dec-2023
https://dl.acm.org/doi/10.1109/TCE.2023.3339633
Kashyap SSingh A(2023)Prediction-based scheduling techniques for cloud data center’s workload: a systematic reviewCluster Computing10.1007/s10586-023-04024-826:5(3209-3235)Online publication date: 18-May-2023
https://dl.acm.org/doi/10.1007/s10586-023-04024-8
Show More Cited By

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

Cited By

Recommendations

Horus: An Interference-Aware Resource Manager for Deep Learning Systems

Scheduling CPU for GPU-based Deep Learning Jobs

Hydra: Deadline-Aware and Efficiency-Oriented Scheduling for Deep Learning Jobs on Heterogeneous GPUs

Comments

Information

Published In

Publisher

Publication History

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations