research-article

Potential-Based Reward Shaping for Intrinsic Motivation

Authors:

Grant C. Forbes,

Leonardo Villalobos-Arias,

Colin M. Potts,

David L. RobertsAuthors Info & Claims

AAMAS '24: Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems

Pages 589 - 597

Published: 06 May 2024 Publication History

Abstract

Recently there has been a proliferation of intrinsic motivation (IM) reward-shaping methods to learn in complex and sparse-reward environments. These methods can often inadvertently change the set of optimal policies in an environment, leading to suboptimal behavior. Previous work on mitigating the risks of reward shaping, particularly through potential-based reward shaping (PBRS), has not been applicable to many IM methods, as they are often complex, trainable functions themselves, and therefore dependent on a wider set of variables than the traditional reward functions that PBRS was developed for. We present an extension to PBRS that we prove preserves the set of optimal policies under a more general set of functions than has been previously proven. We also present Potential-Based Intrinsic Motivation (PBIM), a method for converting IM rewards into a potential-based form that is useable without altering the set of optimal policies. Testing in the MiniGrid DoorKey and Cliff Walking environments, we demonstrate that PBIM successfully prevents the agent from converging to a suboptimal policy and can speed up training.

References

[1]

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. 2016. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565.

[2]

Paniz Behboudian, Yash Satsangi, Matthew E Taylor, Anna Harutyunyan, and Michael Bowling. 2022. Policy invariant explicit shaping: an efficient alternative to reward shaping. Neural Computing and Applications, Vol. 34 (2022), 1--14.

Digital Library

[3]

Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. 2016. Unifying count-based exploration and intrinsic motivation. Advances in neural information processing systems, Vol. 29 (2016), 1471--1479.

[4]

Yuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, and Alexei A Efros. 2018a. Large-scale study of curiosity-driven learning. arXiv preprint arXiv:1808.04355.

[5]

Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. 2018b. Exploration by random network distillation. arXiv preprint arXiv:1810.12894.

[6]

Eric Chen, Zhang-Wei Hong, Joni Pajarinen, and Pulkit Agrawal. 2022. Redeeming Intrinsic Rewards via Constrained Optimization. arXiv preprint arXiv:2211.07627.

[7]

Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. 2018. Minimalistic Gridworld Environment for Gymnasium. https://github.com/Farama-Foundation/Minigrid.

[8]

Sam Michael Devlin and Daniel Kudenko. 2012. Dynamic potential-based reward shaping. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems. IFAAMAS, Valencia, Spain, 433--440.

Digital Library

[9]

Adam Eck, Leen-Kiat Soh, Sam Devlin, and Daniel Kudenko. 2016. Potential-based reward shaping for finite horizon online pomdp planning. Autonomous Agents and Multi-Agent Systems, Vol. 30, 3 (2016), 403--445.

Digital Library

[10]

Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Levine. 2018. Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070.

[11]

Grant C. Forbes and David L. Roberts. 2024. Potential-Based Reward Shaping For Intrinsic Motivation (Student Abstract). In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI.

[12]

Prasoon Goyal, Scott Niekum, and Raymond J Mooney. 2019. Using natural language for reward shaping in reinforcement learning. arXiv preprint arXiv:1903.02020.

[13]

Marek Grzes. 2017. Reward shaping in episodic reinforcement learning. In Sixteenth International Conference on Autonomous Agents and Multiagent Systems. ACM, São Paulo, Brazil, 565--573.

[14]

Anna Harutyunyan, Sam Devlin, Peter Vrancx, and Ann Nowé. 2015. Expressing arbitrary reward functions as potential-based advice. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 29(1). AAAI, Austin, Texas, USA, 2652--2658.

[15]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.

[16]

Adam Daniel Laud. 2004. Theory and application of reward shaping in reinforcement learning. University of Illinois at Urbana-Champaign, Illinois, USA.

[17]

Shakir Mohamed and Danilo Jimenez Rezende. 2015. Variational information maximisation for intrinsically motivated reinforcement learning. Advances in neural information processing systems, Vol. 28 (2015), 9.

[18]

Andrew Y. Ng. 2003. Shaping and policy search in reinforcement learning. Ph.D. Dissertation. University of California, Berkeley.

[19]

Andrew Y Ng, Daishi Harada, and Stuart Russell. 1999. Policy invariance under reward transformations: Theory and application to reward shaping. In Icml, Vol. 99. Morgan Kaufmann, San Francisco, CA, USA, 278--287.

[20]

Georg Ostrovski, Marc G Bellemare, Aäron Oord, and Rémi Munos. 2017. Count-based exploration with neural density models. In International conference on machine learning. PMLR, MLResearchPress, Sydney, Australia, 2721--2730.

[21]

Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. 2017. Curiosity-driven exploration by self-supervised prediction. In International conference on machine learning. PMLR, MLResearchPress, Sydney, Australia, 2778--2787.

[22]

Roberta Raileanu and Tim Rocktäschel. 2020. Ride: Rewarding impact-driven exploration for procedurally-generated environments. arXiv preprint arXiv:2002.12292.

[23]

Jette Randløv and Preben Alstrøm. 1998. Learning to Drive a Bicycle Using Reinforcement Learning and Shaping. In ICML, Vol. 98. Citeseer, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 463--471.

[24]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

[25]

Alexander L Strehl and Michael L Littman. 2008. An analysis of model-based interval estimation for Markov decision processes. J. Comput. System Sci., Vol. 74, 8 (2008), 1309--1331.

Digital Library

[26]

Richard S Sutton. 1990. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine learning proceedings 1990. Elsevier, Texas, USA, 216--224.

[27]

Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press, CA, USA.

Digital Library

[28]

Eric Wiewiora. 2003. Potential-based shaping and Q-value initialization are equivalent. Journal of Artificial Intelligence Research, Vol. 19 (2003), 205--208.

Digital Library

[29]

Eric Wiewiora, Garrison W Cottrell, and Charles Elkan. 2003. Principled methods for advising reinforcement learning agents. In Proceedings of the 20th International Conference on Machine Learning (ICML-03). AAAI Press, Washington DC, USA, 792--799.

Digital Library

[30]

Lucas Willems. 2023. rl-starter-files. https://github.com/lcswillems/rl-starter-files

Index Terms

Potential-Based Reward Shaping for Intrinsic Motivation
1. Computing methodologies
  1. Machine learning
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory
      1. Markov decision processes
      2. Reinforcement learning

Recommendations

Potential-based reward shaping for finite horizon online POMDP planning

In this paper, we address the problem of suboptimal behavior during online partially observable Markov decision process (POMDP) planning caused by time constraints on planning. Taking inspiration from the related field of reinforcement learning (RL), ...
Potential-based reward shaping for POMDPs
AAMAS '13: Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems

We address the problem of suboptimal behavior caused by short horizons during online POMDP planning. Our solution extends potential-based reward shaping from the related field of reinforcement learning to online POMDP planning in order to improve ...
Potential-based reward shaping using state–space segmentation for efficiency in reinforcement learning
Abstract
Reinforcement Learning (RL) algorithms encounter slow learning in environments with sparse explicit reward structures due to the limited feedback available on the agent’s behavior. This problem is exacerbated particularly in complex tasks with ...
Highlights
- An online and proper segmentation with Extended Segmented Q-Cut approach on state space of the given RL problem leads a decomposition of the task for the learning agent.
- Applying reward shaping based on this segmentation compensate ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

AAMAS '24: Proceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems

May 2024

2898 pages

ISBN:9798400704864

General Chairs:
Mehdi Dastani
Utrecht University, Netherlands
,
Jaime Simão Sichman
University of São Paulo, Brazil
,
Program Chairs:
Natasha Alechina
Utrecht University, Netherlands
,
Virginia Dignum
Umeå University, Sweden

Sponsors

Publisher

International Foundation for Autonomous Agents and Multiagent Systems

Richland, SC

Publication History

Published: 06 May 2024

Check for updates

Author Tags

Qualifiers

Research-article

Conference

AAMAS '23

Sponsor:

SIGAI

AAMAS '23: International Conference on Autonomous Agents and Multiagent Systems

May 6 - 10, 2024

Auckland, New Zealand

Acceptance Rates

Overall Acceptance Rate 1,155 of 5,036 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
22
Total Downloads

Downloads (Last 12 months)22
Downloads (Last 6 weeks)3

Reflects downloads up to 22 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents