skip to main content
10.5555/3635637.3662914acmconferencesArticle/Chapter ViewAbstractPublication PagesaamasConference Proceedingsconference-collections
research-article

Analysing the Sample Complexity of Opponent Shaping

Published: 06 May 2024 Publication History

Abstract

Learning in general-sum games often yields collectively sub-optimal results. Addressing this, opponent shaping (OS) methods actively guide the learning processes of other agents, empirically leading to improved individual and group performances. Early OS methods use higher-order derivatives to shape the learning of co-players, making them unsuitable to shape multiple learning steps. Follow-up work Model-free Opponent Shaping (M-FOS) addresses the short-comings of earlier OS methods by reframing the OS problem into a meta-game. In contrast to early OS methods, there is little theoretical understanding of the M-FOS framework. Providing theoretical guarantees for M-FOS is hard because A) there is little literature on theoretical sample complexity bounds for meta-reinforcement learning; B) M-FOS operates in continuous state and action spaces hence theoretical analysis is challenging. In this work, we present R-FOS, a tabular version of M-FOS that is more suitable for theoretical analysis. R-FOS discretises the continuous meta-game MDP into a tabular MDP. Within this discretised MDP, we adapt the Rmax algorithm, most prominently used to derive PAC-bounds for MDPs, as the meta-learner in the R-FOS algorithm. We derive a sample complexity bound that is exponential in the cardinality of the inner state and action space and the number of agents. Our bound guarantees that, with high probability, the optimal policy learned by an R-FOS agent can get close to the optimal policy, apart from a constant factor. Finally, we investigate how R-FOS's sample complexity scales in the size of state-action space. Our theoretical results on scaling are supported empirically in the Matching Pennies environment.

References

[1]
David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore Graepel. 2018. The mechanics of n-player differentiable games. In International Conference on Machine Learning. PMLR, 354--363.
[2]
Craig Boutilier, Thomas Dean, and Steve Hanks. 1999. Decision- Theoretic Planning: Structural Assumptions and Computational Leverage. J. Artif. Int. Res. 11, 1 (jul 1999), 1--94.
[3]
Ronen Brafman and Moshe Tennenholtz. 2001. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning. The Journal of Machine Learning Research 3, 953--958. https://doi.org/10.1162/153244303765208377
[4]
CHEE-S Chow and John N Tsitsiklis. 1991. An optimal one-way multigrid algorithm for discrete-time stochastic control. IEEE transactions on automatic control 36, 8 (1991), 898--914.
[5]
Murat A. Erdogdu. 2022. Covering with epsilon-nets. https://erdogdu.github.io/csc2532/lectures/lecture05.pdf
[6]
Jakob N. Foerster, Richard Y. Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. 2018. Learning with Opponent-Learning Awareness. arXiv:1709.04326 [cs.AI]
[7]
Kitty Fung, Qizhen Zhang, Chris Lu, Jia Wan, Timon Willi, and Jakob Foerster. 2024. Analysing the Sample Complexity of Opponent Shaping. arXiv preprint arXiv:2402.05782 (2024).
[8]
D Haussler and EWelzl. 1986. Epsilon-Nets and Simplex Range Queries. In Proceedings of the Second Annual Symposium on Computational Geometry (Yorktown Heights, New York, USA) (SCG '86). Association for Computing Machinery, New York, NY, USA, 61--71. https://doi.org/10.1145/10515.10522
[9]
Sham Kakade. 2003. On the sample complexity of Reinforcement Learning. Ph.D. Dissertation. University of London.
[10]
Akbir Khan, Newton Kwan, Timon Willi, Chris Lu, Andrea Tacchetti, and Jakob Nicolaus Foerster. [n.d.]. Context and History Aware Other- Shaping. ([n. d.]).
[11]
Dong-Ki Kim, Miao Liu, Matthew D Riemer, Chuangchuang Sun, Marwa Abdulhai, Golnaz Habibi, Sebastian Lopez-Cot, Gerald Tesauro, and JONATHAN P HOW. 2021. A Policy Gradient Algorithm for Learning to Learn in Multiagent Reinforcement Learning. https://openreview.net/forum?id=zdrls6LIX4W
[12]
Alistair Letcher. 2020. On the impossibility of global convergence in multi-loss optimization. arXiv preprint arXiv:2005.12649 (2020).
[13]
Alistair Letcher, Jakob Foerster, David Balduzzi, Tim Rocktäschel, and Shimon Whiteson. 2018. Stable opponent shaping in differentiable games. arXiv preprint arXiv:1811.08469 (2018).
[14]
Yao Liu and Emma Brunskill. 2018. When Simple Exploration is Sample Efficient: Identifying Sufficient Conditions for Random Exploration to Yield PAC RL Algorithms.
[15]
Chris Lu, Timon Willi, Christian A. Schroeder de Witt, and Jakob N. Foerster. 2022. Model-Free Opponent Shaping. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA (Proceedings of Machine Learning Research, Vol. 162). PMLR, 14398--14411.
[16]
Chris Lu, Timon Willi, Alistair Letcher, and Jakob Foerster. 2022. Adversarial Cheap Talk. arXiv preprint arXiv:2211.11030 (2022).
[17]
Dhruv Malik, Aldo Pacchiano, Vishwak Srinivasan, and Yuanzhi Li. 2021. Sample Efficient Reinforcement Learning In Continuous State Spaces: A Perspective Beyond Linearity. In International Conference on Machine Learning.
[18]
Florian Schäfer and Anima Anandkumar. 2019. Competitive gradient descent. Advances in Neural Information Processing Systems 32 (2019).
[19]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
[20]
Satinder P. Singh, Tommi Jaakkola, and Michael I. Jordan. 1994. Reinforcement Learning with Soft State Aggregation. In Proceedings of the 7th International Conference on Neural Information Processing Systems (Denver, Colorado) (NIPS'94). MIT Press, Cambridge, MA, USA, 361--368.
[21]
Alexander L. Strehl, Lihong Li, and Michael L. Littman. 2009. Reinforcement Learning in Finite MDPs: PAC Analysis. J. Mach. Learn. Res. 10 (dec 2009), 2413--2444.
[22]
Timon Willi, Alistair Hp Letcher, Johannes Treutlein, and Jakob Foerster. 2022. COLA: consistent learning with opponent-learning awareness. In International Conference on Machine Learning. PMLR, 23804--23831.
[23]
Qizhen Zhang, Chris Lu, Animesh Garg, and Jakob Foerster. 2022. Centralized Model and Exploration Policy for Multi-Agent RL. In International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS). https://arxiv.org/abs/2107.06434v2
[24]
Stephen Zhao, Chris Lu, Roger B Grosse, and Jakob Foerster. 2022. Proximal Learning With Opponent-Learning Awareness. Advances in Neural Information Processing Systems 35 (2022), 26324--26336.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
AAMAS '24: Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems
May 2024
2898 pages
ISBN:9798400704864

Sponsors

Publisher

International Foundation for Autonomous Agents and Multiagent Systems

Richland, SC

Publication History

Published: 06 May 2024

Check for updates

Author Tags

  1. meta reinforcement learning
  2. multi-agent
  3. opponent shaping
  4. reinforcement learning
  5. sample complexity

Qualifiers

  • Research-article

Funding Sources

  • Cohere
  • D.H. Chen Foundation
  • Armasuisse

Conference

AAMAS '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,155 of 5,036 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 8
    Total Downloads
  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)0
Reflects downloads up to 22 Sep 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media