DyLam : a dynamic reward weighting method for reinforcement learning policy gradient algorithms

MACHADO, Mateus Gonçalves

Please use this identifier to cite or link to this item: https://repositorio.ufpe.br/handle/123456789/46630

Share on

Title:	DyLam : a dynamic reward weighting method for reinforcement learning policy gradient algorithms
Authors:	MACHADO, Mateus Gonçalves
Keywords:	Engenharia da computação; Aprendizagem
Issue Date:	7-Jun-2022
Publisher:	Universidade Federal de Pernambuco
Citation:	MACHADO, Mateus Gonçalves. DyLam: a dynamic reward weighting method for reinforcement learning policy gradient algorithms. 2022. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de Pernambuco, Recife, 2022.
Abstract:	Reinforcement Learning (RL) is an emergent subfield of Machine Learning in which an agent interacts with an environment and leverages their experiences to learn, by trial and error, which actions are the most appropriate for each state. At each step the agent receives a positive or negative reward signal, which is the main feedback used for learning. RL finds applications in many areas, such as robotics, stock exchange, and even in cooling systems, presenting superhuman performance in learning to play board games (Chess and Go) and video games (Atari Games, Dota2, and StarCraft2). However, RL methods still struggle in environments with sparse rewards. For example, an agent may receive very few goal score rewards in a soccer game. Thus, it is hard to associate rewards (goals) with actions. Researchers frequently introduce multiple intermediary rewards to help learning and circumvent this problem. However, adequately combining multiple rewards to compose the unique reward signal used by the RL methods frequently is not an easy task. This work aims to solve this specific problem by introducing DyLam. It extends existing policy gradient methods by decomposing the reward function used in the environment and dynamically weighting each component as a function of the agent’s performance on the associated task. We prove the convergence of the proposed method and show empirically that it overcomes competitor methods in the environments evaluated in terms of learning speed and, in some cases, the final performance.
URI:	https://repositorio.ufpe.br/handle/123456789/46630
Appears in Collections:	Dissertações de Mestrado - Ciência da Computação

Files in This Item:

File	Description	Size	Format
DISSERTAÇÃO Mateus Gonçalves Machado.pdf		7,09 MB	Adobe PDF	View/Open

This item is protected by original copyright

View License

Show full item record Recommend this item

This item is licensed under a Creative Commons License