k-Certainty Exploration Method: an action selector to identify the environment in reinforcement learning

作者：

Highlights：

•

摘要

Reinforcement learning aims to adapt an agent to an unknown environment according to rewards. There are two issues to handle delayed reward and uncertainty. Q-learning is a representative reinforcement learning method. It is used in many works since it can learn an optimum policy. However, Q-learning needs numerous trials to converge to an optimum policy. If the target environments can be described in Markov decision processes, we can identify them from statistics of sensor-action pairs. When we build the correct environment model, we can derive an optimum policy with the Policy Iteration Algorithm. Therefore, we can construct an optimum policy through identifying environments efficiently.We separate the learning process into two phases: identifying an environment and determining an optimum policy. We propose the k-Certainty Exploration Method for identifying an environment. After that, an optimum policy is determined by the Policy Iteration Algorithm. We call a rule k-certainty if and only if it has been selected k times or more. The k-Certainty Exploration Method excepts any loop of rules that already achieve k-certainty. We show its effectiveness by comparing it with Q-learning in two experiments. One is Sutton's maze-like environment, the other is an original environment where an optimum policy varies according to a parameter.

论文关键词：Reinforcement learning,Q-learning,Markov decision processes,Policy Iteration Algorithm,k-Certainty Exploration Method

论文评审过程：Available online 19 May 1998.

论文官网地址：https://doi.org/10.1016/S0004-3702(96)00062-8