DyLam : a dynamic reward weighting method for reinforcement learning policy gradient algorithms

MACHADO, Mateus Gonçalves

Por favor, use este identificador para citar o enlazar este ítem: https://repositorio.ufpe.br/handle/123456789/46630

Comparte esta pagina

Título :	DyLam : a dynamic reward weighting method for reinforcement learning policy gradient algorithms
Autor :	MACHADO, Mateus Gonçalves
Palabras clave :	Engenharia da computação; Aprendizagem
Fecha de publicación :	7-jun-2022
Editorial :	Universidade Federal de Pernambuco
Citación :	MACHADO, Mateus Gonçalves. DyLam: a dynamic reward weighting method for reinforcement learning policy gradient algorithms. 2022. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de Pernambuco, Recife, 2022.
Resumen :	Reinforcement Learning (RL) is an emergent subfield of Machine Learning in which an agent interacts with an environment and leverages their experiences to learn, by trial and error, which actions are the most appropriate for each state. At each step the agent receives a positive or negative reward signal, which is the main feedback used for learning. RL finds applications in many areas, such as robotics, stock exchange, and even in cooling systems, presenting superhuman performance in learning to play board games (Chess and Go) and video games (Atari Games, Dota2, and StarCraft2). However, RL methods still struggle in environments with sparse rewards. For example, an agent may receive very few goal score rewards in a soccer game. Thus, it is hard to associate rewards (goals) with actions. Researchers frequently introduce multiple intermediary rewards to help learning and circumvent this problem. However, adequately combining multiple rewards to compose the unique reward signal used by the RL methods frequently is not an easy task. This work aims to solve this specific problem by introducing DyLam. It extends existing policy gradient methods by decomposing the reward function used in the environment and dynamically weighting each component as a function of the agent’s performance on the associated task. We prove the convergence of the proposed method and show empirically that it overcomes competitor methods in the environments evaluated in terms of learning speed and, in some cases, the final performance.
URI :	https://repositorio.ufpe.br/handle/123456789/46630
Aparece en las colecciones:	Dissertações de Mestrado - Ciência da Computação

Ficheros en este ítem:

Fichero	Descripción	Tamaño	Formato
DISSERTAÇÃO Mateus Gonçalves Machado.pdf		7,09 MB	Adobe PDF	Visualizar/Abrir

Este ítem está protegido por copyright original

Visualizar la licencia

Mostrar el registro Dublin Core completo del ítem Recomiende este ítem

Este ítem está sujeto a una licencia Creative Commons Licencia Creative Commons