DyLam : a dynamic reward weighting method for reinforcement learning policy gradient algorithms

MACHADO, Mateus Gonçalves

Use este identificador para citar ou linkar para este item: https://repositorio.ufpe.br/handle/123456789/46630

Compartilhe esta página

Título:	DyLam : a dynamic reward weighting method for reinforcement learning policy gradient algorithms
Autor(es):	MACHADO, Mateus Gonçalves
Palavras-chave:	Engenharia da computação; Aprendizagem
Data do documento:	7-Jun-2022
Editor:	Universidade Federal de Pernambuco
Citação:	MACHADO, Mateus Gonçalves. DyLam: a dynamic reward weighting method for reinforcement learning policy gradient algorithms. 2022. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de Pernambuco, Recife, 2022.
Abstract:	Reinforcement Learning (RL) is an emergent subfield of Machine Learning in which an agent interacts with an environment and leverages their experiences to learn, by trial and error, which actions are the most appropriate for each state. At each step the agent receives a positive or negative reward signal, which is the main feedback used for learning. RL finds applications in many areas, such as robotics, stock exchange, and even in cooling systems, presenting superhuman performance in learning to play board games (Chess and Go) and video games (Atari Games, Dota2, and StarCraft2). However, RL methods still struggle in environments with sparse rewards. For example, an agent may receive very few goal score rewards in a soccer game. Thus, it is hard to associate rewards (goals) with actions. Researchers frequently introduce multiple intermediary rewards to help learning and circumvent this problem. However, adequately combining multiple rewards to compose the unique reward signal used by the RL methods frequently is not an easy task. This work aims to solve this specific problem by introducing DyLam. It extends existing policy gradient methods by decomposing the reward function used in the environment and dynamically weighting each component as a function of the agent’s performance on the associated task. We prove the convergence of the proposed method and show empirically that it overcomes competitor methods in the environments evaluated in terms of learning speed and, in some cases, the final performance.
URI:	https://repositorio.ufpe.br/handle/123456789/46630
Aparece nas coleções:	Dissertações de Mestrado - Ciência da Computação

Arquivos associados a este item:

Arquivo	Descrição	Tamanho	Formato
DISSERTAÇÃO Mateus Gonçalves Machado.pdf		7,09 MB	Adobe PDF	Visualizar/Abrir

Este arquivo é protegido por direitos autorais

Ver licença

Mostrar registro completo do item Recomendar este item Visualizar estatísticas

Este item está licenciada sob uma Licença Creative Commons