MDMD options discovery for accelerating exploration in sparse-reward domains

摘要

One of the main challenges in reinforcement learning is how an agent explores the environment with sparse rewards to learn the optimal policy. Although many option discovery methods have been proposed to improve exploration in sparse-reward domains, it is still an open question how to accelerate exploration in a near-optimal manner. Recently, covering options was proposed to find a set of options that reduce the expected cover time of environment—the expected number of steps required to visit every state in the environment. Specifically, covering options constructs options by eigenvectors of the graph Laplacian matrix to minimize the environment’s expected cover time. However, calculating the whole graph Laplacian matrix directly has high computational time complexity usually, especially for a large sparse graph, so this method does not well solve the problem of accelerating exploration in sparse-reward domains. In this paper, we propose a new option discovery method, Min Degree and Max Distance (MDMD) options, to accelerate exploration in sparse-reward domains by reducing the expected cover time of the environment. Specifically, our algorithm heuristically selects state transition matrix’s two nonadjacent vertices with the minimum degree and the maximum distance as options. The generated options can provably reduce the environment’s expected cover time by using the transition function learned by the agent. Without calculating the graph Laplacian matrix and its eigenvectors, our method can accelerate exploration in sparse-reward domains. In six challenging sparse-reward environments, experimental results show that our approach significantly accelerates exploration and thus obtains a higher total cumulative reward than other option discovery methods.