skip to main content
10.1145/3651890.3672239acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article

Crux: GPU-Efficient Communication Scheduling for Deep Learning Training

Published: 04 August 2024 Publication History

Abstract

Deep learning training (DLT), e.g., large language model (LLM) training, has become one of the most important services in multitenant cloud computing. By deeply studying in-production DLT jobs, we observed that communication contention among different DLT jobs seriously influences the overall GPU computation utilization, resulting in the low efficiency of the training cluster. In this paper, we present Crux, a communication scheduler that aims to maximize GPU computation utilization by mitigating the communication contention among DLT jobs. Maximizing GPU computation utilization for DLT, nevertheless, is NP-Complete; thus, we formulate and prove a novel theorem to approach this goal by GPU intensity-aware communication scheduling. Then, we propose an approach that prioritizes the DLT flows with high GPU computation intensity, reducing potential communication contention. Our 96-GPU testbed experiments show that Crux improves 8.3% to 14.8% GPU computation utilization. The large-scale production trace-based simulation further shows that Crux increases GPU computation utilization by up to 23% compared with alternatives including Sincronia, TACCL, and CASSINI.

References

[1]
2022. AMD uProf. https://www.amd.com/en/developer/uprof.html.
[2]
2022. Equal-cost multi-path routing (ECMP). https://en.wikipedia.org/wiki/Equal-cost_multi-path_routing.
[3]
2022. Intel Performance Counter Monitor. https://github.com/intel/pcm.
[4]
2022. NVIDIA Collective Communications Library (NCCL). https://developer.nvidia.com/nccl.
[5]
2022. PyTorch. https://pytorch.org/.
[6]
2022. RDMA over Converged Ethernet. https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet.
[7]
2022. X-DeepLearning. https://github.com/alibaba/x-deeplearning.
[8]
2023. Adobe Firefly. https://www.adobe.com/sensei/generative-ai/firefly.html.
[9]
2023. Alibaba GPU Cluster Trace 2023. https://github.com/alibaba/alibaba-lingjun-dataset-2023.
[10]
2023. Breadth-first search. https://en.wikipedia.org/wiki/Breadth-first_search.
[11]
2023. Github Copilot. https://github.com/features/copilot.
[12]
2023. Microsoft365. https://www.microsoft.com/en-us/microsoft-365.
[13]
2024. Megatron GPT3 MODEL. https://github.com/NVIDIA/Megatron-LM/tree/main/examples/gpt3.
[14]
2024. Multi-commodity flow problem. https://en.wikipedia.org/wiki/Multi-commodity_flow_problem.
[15]
Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In OSDI.
[16]
Saksham Agarwal, Shijin Rajakrishnan, Akshay Narayan, Rachit Agarwal, David B. Shmoys, and Amin Vahdat. 2018. Sincronia: near-optimal network design for coflows. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, SIGCOMM 2018, Budapest, Hungary, August 20--25, 2018. ACM, 16--29.
[17]
Wei Bai, Kai Chen, Hao Wang, Li Chen, Dongsu Han, and Chen Tian. 2015. Information-Agnostic Flow Scheduling for Commodity Data Centers. In 12th USENIX Symposium on Networked Systems Design and Implementation, NSDI 15, Oakland, CA, USA, May 4--6, 2015. USENIX Association, 455--468. https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/bai
[18]
Ran Ben Basat, Sivaramakrishnan Ramanathan, Yuliang Li, Gianni Antichi, Minlan Yu, and Michael Mitzenmacher. 2020. PINT: Probabilistic In-band Network Telemetry. In SIGCOMM '20: Proceedings of the 2020 Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication, Virtual Event, USA, August 10--14, 2020. ACM, 662--680.
[19]
Li Chen, Kai Chen, Wei Bai, and Mohammad Alizadeh. 2016. Scheduling Mix-flows in Commodity Datacenters with Karuna. In Proceedings of the ACM SIGCOMM 2016 Conference, Florianopolis, Brazil, August 22--26, 2016. ACM, 174--187.
[20]
Quan Chen, Hailong Yang, Jason Mars, and Lingjia Tang. 2016. Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers. (2016), 681--696.
[21]
Yangrui Chen, Yanghua Peng, Yixin Bao, Chuan Wu, Yibo Zhu, and Chuanxiong Guo. 2020. Elastic parameter server load distribution in deep learning clusters. In SoCC '20: ACM Symposium on Cloud Computing, Virtual Event, USA, October 19--21, 2020. ACM, 507--521.
[22]
Mosharaf Chowdhury and Ion Stoica. 2012. Coflow: a networking abstraction for cluster applications. In 11th ACM Workshop on Hot Topics in Networks, HotNets-XI, Redmond, WA, USA - October 29 - 30, 2012. ACM, 31--36.
[23]
Mosharaf Chowdhury and Ion Stoica. 2015. Efficient Coflow Scheduling Without Prior Knowledge. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication, SIGCOMM 2015, London, United Kingdom, August 17--21, 2015. ACM, 393--406.
[24]
Mosharaf Chowdhury, Matei Zaharia, Justin Ma, Michael I. Jordan, and Ion Stoica. 2011. Managing data transfers in computer clusters with orchestra. In Proceedings of the ACM SIGCOMM 2011 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Toronto, ON, Canada, August 15--19, 2011. ACM, 98--109.
[25]
Mosharaf Chowdhury, Yuan Zhong, and Ion Stoica. 2014. Efficient coflow scheduling with Varys. In ACM SIGCOMM 2014 Conference, SIGCOMM'14, Chicago, IL, USA, August 17--22, 2014. ACM, 443--454.
[26]
Fahad R. Dogar, Thomas Karagiannis, Hitesh Ballani, and Antony I. T. Rowstron. 2014. Decentralized task-aware scheduling for data center networks. In ACM SIGCOMM 2014 Conference, SIGCOMM'14, Chicago, IL, USA, August 17--22, 2014. ACM, 431--442.
[27]
Jianbo Dong, Zheng Cao, Tao Zhang, Jianxi Ye, Shaochuang Wang, Fei Feng, Li Zhao, Xiaoyong Liu, Liuyihan Song, Liwei Peng, Yiqun Guo, Xiaowei Jiang, Lingbo Tang, Yin Du, Yingya Zhang, Pan Pan, and Yuan Xie. 2020. EFLOPS: Algorithm and System Co-Design for a High Performance Distributed Training Platform. In IEEE International Symposium on High Performance Computer Architecture, HPCA 2020, San Diego, CA, USA, February 22--26, 2020. IEEE, 610--622.
[28]
Jianbo Dong, Shaochuang Wang, Fei Feng, Zheng Cao, Heng Pan, Lingbo Tang, Pengcheng Li, Hao Li, Qianyuan Ran, Yiqun Guo, Shanyuan Gao, Xin Long, Jie Zhang, Yong Li, Zhisheng Xia, Liuyihan Song, Yingya Zhang, Pan Pan, Guohui Wang, and Xiaowei Jiang. 2021. ACCL: Architecting Highly Scalable Distributed Training Systems With Highly Efficient Collective Communication Library. IEEE Micro 41, 5 (2021), 85--92.
[29]
Adam Dunkels, Richard Gold, Sergio Angel Marti, Arnold Pears, and Mats Uddenfeldt. 2005. Janus: An Architecture for Flexible Access to Sensor Networks. In Proceedings of the 1st ACM Workshop on Dynamic Interconnection of Networks (Cologne, Germany) (DIN '05). Association for Computing Machinery, New York, NY, USA, 48--52.
[30]
Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao, and Aditya Akella. 2014. Multi-resource packing for cluster schedulers. (2014), 455--466.
[31]
Robert Grandl, Srikanth Kandula, Sriram Rao, Aditya Akella, and Janardhan Kulkarni. 2016. GRAPHENE: Packing and Dependency-Aware Scheduling for Data-Parallel Clusters. In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2--4, 2016. USENIX Association, 81--97. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/grandl_graphene
[32]
Albert G. Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David A. Maltz, Parveen Patel, and Sudipta Sengupta. 2009. VL2: a scalable and flexible data center network. In Proceedings of the ACM SIGCOMM 2009 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Barcelona, Spain, August 16--21, 2009. ACM, 51--62.
[33]
Juncheng Gu, Mosharaf Chowdhury, Kang G. Shin, Yibo Zhu, Myeongjae Jeon, Junjie Qian, Hongqiang Harry Liu, and Chuanxiong Guo. 2019. Tiresias: A GPU Cluster Manager for Distributed Deep Learning. In 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019, Boston, MA, February 26--28, 2019. USENIX Association, 485--500. https://www.usenix.org/conference/nsdi19/presentation/gu
[34]
Roger W. Hockney. 1994. The Communication Challenge for MPP: Intel Paragon and Meiko CS-2. Parallel Comput. 20, 3 (1994), 389--398.
[35]
Xin Sunny Huang, Yiting Xia, and T. S. Eugene Ng. 2020. Weaver: Efficient Coflow Scheduling in Heterogeneous Parallel Networks. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), New Orleans, LA, USA, May 18--22, 2020. IEEE, 1071--1081.
[36]
Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, and Fan Yang. 2019. Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads. In 2019 USENIX Annual Technical Conference, USENIX ATC 2019, Renton, WA, USA, July 10--12, 2019. USENIX Association, 947--960. https://www.usenix.org/conference/atc19/presentation/jeon
[37]
Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. 2020. A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters. In 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020, Virtual Event, November 4--6, 2020. USENIX Association, 463--479. https://www.usenix.org/conference/osdi20/presentation/jiang
[38]
Sangeetha Abdu Jyothi, Sayed Hadi Hashemi, Roy H. Campbell, and Brighten Godfrey. 2020. Towards An Application Objective-Aware Network Interface. In 12th USENIX Workshop on Hot Topics in Cloud Computing, HotCloud 2020, July 13--14, 2020. USENIX Association. https://www.usenix.org/conference/hotcloud20/presentation/jyothi
[39]
Changhoon Kim, Anirudh Sivaraman, Naga Katta, Antonin Bas, Advait Dixit, and Lawrence J Wobker. 2015. In-band network telemetry via programmable dataplanes. In SIGCOMM.
[40]
Taesup Kim, Inchul Song, and Yoshua Bengio. 2017. Dynamic Layer Normalization for Adaptive Neural Acoustic Modeling in Speech Recognition. In Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20--24, 2017. ISCA, 2411--2415.
[41]
ChonLam Lao, Yanfang Le, Kshiteej Mahajan, Yixi Chen, Wenfei Wu, Aditya Akella, and Michael M. Swift. 2021. ATP: In-network Aggregation for Multitenant Learning. In 18th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2021, April 12--14, 2021. USENIX Association, 741--761. https://www.usenix.org/conference/nsdi21/presentation/lao
[42]
Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, and Hong Xu. 2023. Accelerating Distributed MoE Training and Inference with Lina. In 2023 USENIX Annual Technical Conference, USENIX ATC 2023, Boston, MA, USA, July 10--12, 2023. USENIX Association, 945--959. https://www.usenix.org/conference/atc23/presentation/li-jiamin
[43]
Yuliang Li, Rui Miao, Hongqiang Harry Liu, Yan Zhuang, Fei Feng, Lingbo Tang, Zheng Cao, Ming Zhang, Frank Kelly, Mohammad Alizadeh, and Minlan Yu. 2019. HPCC: high precision congestion control. In Proceedings of the ACM Special Interest Group on Data Communication, SIGCOMM 2019, Beijing, China, August 19--23, 2019. ACM, 44--58.
[44]
Kefei Liu, Zhuo Jiang, Jiao Zhang, Haoran Wei, Xiaolong Zhong, Lizhuang Tan, Tian Pan, and Tao Huang. 2023. Hostping: Diagnosing Intra-host Network Bottlenecks in RDMA Servers. In 20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2023, Boston, MA, April 17--19, 2023. USENIX Association, 15--29. https://www.usenix.org/conference/nsdi23/presentation/liu-kefei
[45]
Tianfeng Liu, Yangrui Chen, Dan Li, Chuan Wu, Yibo Zhu, Jun He, Yanghua Peng, Hongzheng Chen, Hongzhi Chen, and Chuanxiong Guo. 2023. BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing. In 20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2023, Boston, MA, April 17--19, 2023. USENIX Association, 103--118. https://www.usenix.org/conference/nsdi23/presentation/liu-tianfeng
[46]
Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. 2020. Themis: Fair and Efficient GPU Cluster Scheduling. In 17th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2020, Santa Clara, CA, USA, February 25--27, 2020. USENIX Association, 289--304. https://www.usenix.org/conference/nsdi20/presentation/mahajan
[47]
Kshiteej Mahajan, Ching-Hsiang Chu, Srinivas Sridharan, and Aditya Akella. 2023. Better Together: Jointly Optimizing ML Collective Scheduling and Execution Planning using SYNDICATE. In 20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2023, Boston, MA, April 17--19, 2023. USENIX Association, 809--824. https://www.usenix.org/conference/nsdi23/presentation/mahajan
[48]
Rui Pan, Yiming Lei, Jialong Li, Zhiqiang Xie, Binhang Yuan, and Yiting Xia. 2022. Efficient flow scheduling in distributed deep learning training with echelon formation. In Proceedings of the 21st ACM Workshop on Hot Topics in Networks, HotNets 2022, Austin, Texas, November 14--15, 2022. ACM, 93--100.
[49]
Pitch Patarasuk and Xin Yuan. 2009. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel Distributed Comput. 69, 2 (2009), 117--124.
[50]
Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. 2019. A generic communication scheduler for distributed DNN training acceleration. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP 2019, Huntsville, ON, Canada, October 27--30, 2019. ACM, 16--29.
[51]
Sudarsanan Rajasekaran, Manya Ghobadi, and Aditya Akella. 2024. CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters. (2024). https://www.usenix.org/conference/nsdi24/presentation/rajasekaran
[52]
Sudarsanan Rajasekaran, Manya Ghobadi, Gautam Kumar, and Aditya Akella. 2022. Congestion control in machine learning clusters. In Proceedings of the 21st ACM Workshop on Hot Topics in Networks, HotNets 2022, Austin, Texas, November 14--15, 2022. ACM, 235--242.
[53]
Aashaka Shah, Vijay Chidambaram, Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Jacob Nelson, and Olli Saarikivi. 2023. TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches. In 20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2023, Boston, MA, April 17--19, 2023. USENIX Association, 593--612. https://www.usenix.org/conference/nsdi23/presentation/shah
[54]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4--9, 2017, Long Beach, CA, USA. 5998--6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
[55]
Weitao Wang, Sushovan Das, Xinyu Crystal Wu, Zhuang Wang, Ang Chen, and T. S. Eugene Ng. 2021. MXDAG: A Hybrid Abstraction for Cluster Applications. CoRR abs/2107.07442 (2021). arXiv:2107.07442 https://arxiv.org/abs/2107.07442
[56]
Weitao Wang, Masoud Moshref, Yuliang Li, Gautam Kumar, T. S. Eugene Ng, Neal Cardwell, and Nandita Dukkipati. 2023. Poseidon: Efficient, Robust, and Practical Datacenter CC via Deployable INT. In 20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2023, Boston, MA, April 17--19, 2023. USENIX Association, 255--274. https://www.usenix.org/conference/nsdi23/presentation/wang-weitao
[57]
Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, and Lidong Zhou. 2018. Gandiva: Introspective Cluster Scheduling for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8--10, 2018. USENIX Association, 595--610. https://www.usenix.org/conference/osdi18/presentation/xiao
[58]
Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, and Yangqing Jia. 2020. AntMan: Dynamic Scaling on GPU Clusters for Deep Learning. In 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020, Virtual Event, November 4--6, 2020. USENIX Association, 533--548. https://www.usenix.org/conference/osdi20/presentation/xiao
[59]
F. Frances Yao. 1980. Efficient Dynamic Programming Using Quadrangle Inequalities. In Proceedings of the 12th Annual ACM Symposium on Theory of Computing, April 28--30, 1980, Los Angeles, California, USA. ACM, 429--435.
[60]
Hong Zhang, Kai Chen, Wei Bai, Dongsu Han, Chen Tian, Hao Wang, Haibing Guan, and Ming Zhang. 2015. Guaranteeing deadlines for inter-datacenter transfers. In Proceedings of the Tenth European Conference on Computer Systems, EuroSys 2015, Bordeaux, France, April 21--24, 2015. ACM, 20:1--20:14.
[61]
Hanyu Zhao, Zhenhua Han, Zhi Yang, Quanlu Zhang, Fan Yang, Lidong Zhou, Mao Yang, Francis C. M. Lau, Yuqi Wang, Yifan Xiong, and Bin Wang. 2020. HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees. In 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020, Virtual Event, November 4--6, 2020. USENIX Association, 515--532. https://www.usenix.org/conference/osdi20/presentation/zhao-hanyu
[62]
Mark Zhao, Niket Agarwal, Aarti Basant, Bugra Gedik, Satadru Pan, Mustafa Ozdal, Rakesh Komuravelli, Jerry Pan, Tianshu Bao, Haowei Lu, Sundaram Narayanan, Jack Langman, Kevin Wilfong, Harsha Rastogi, Carole-Jean Wu, Christos Kozyrakis, and Parik Pol. 2022. Understanding data storage and ingestion for large-scale deep recommendation model training: industrial product. In ISCA '22: The 49th Annual International Symposium on Computer Architecture, New York, New York, USA, June 18 - 22, 2022. ACM, 1042--1057.
[63]
Yihao Zhao, Yuanqiang Liu, Yanghua Peng, Yibo Zhu, Xuanzhe Liu, and Xin Jin. 2022. Multi-resource interleaving for deep learning training. In SIGCOMM '22: ACM SIGCOMM 2022 Conference, Amsterdam, The Netherlands, August 22 - 26, 2022. ACM, 428--440.
[64]
Pengfei Zheng, Rui Pan, Tarannum Khan, Shivaram Venkataraman, and Aditya Akella. 2023. Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning. In 20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2023, Boston, MA, April 17--19, 2023. USENIX Association, 703--723. https://www.usenix.org/conference/nsdi23/presentation/zheng

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 Conference
August 2024
1033 pages
ISBN:9798400706141
DOI:10.1145/3651890
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 August 2024

Check for updates

Badges

  • Honorable Mention

Author Tags

  1. communication scheduling
  2. data center network
  3. deep learning

Qualifiers

  • Research-article

Conference

ACM SIGCOMM '24
Sponsor:
ACM SIGCOMM '24: ACM SIGCOMM 2024 Conference
August 4 - 8, 2024
NSW, Sydney, Australia

Acceptance Rates

Overall Acceptance Rate 462 of 3,389 submissions, 14%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 908
    Total Downloads
  • Downloads (Last 12 months)908
  • Downloads (Last 6 weeks)908
Reflects downloads up to 15 Sep 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media