Maximum reward reinforcement learning: A non-cumulative reward criterion

作者:

Highlights:

摘要

Existing reinforcement learning paradigms proposed in the literature are guided by two performance criteria; namely: the expected cumulative-reward, and the average reward criteria. Both of these criteria assume an inherently present cumulative or additivity of the rewards. However, such inherent cumulative of the rewards is not a definite necessity in some contexts. Two possible scenarios are presented in this paper, and are summarized as follows. The first concerns with learning of an optimal policy that is further away in the existence of a sub-optimal policy that is nearer. The cumulative rewards paradigms suffer from slower convergence due to the influence of accumulating the lower rewards, and take time to fade away the effect of the sub-optimal policy. The second scenario concerns with approximating the supremum values of the payoffs of an optimal stopping problem. The payoffs are non-cumulative in nature, and thus the cumulative rewards paradigm is not applicable to resolve this. Hence, a non-cumulative reward reinforcement-learning paradigm is needed in these application contexts. A maximum reward criterion is proposed in this paper, and the resulting reinforcement-learning model with this learning criterion is termed the maximum reward reinforcement learning. The maximum reward reinforcement learning considers the learning of non-cumulative rewards problem, where the agent exhibits a maximum reward-oriented behavior towards the largest rewards in the state-space. Intermediate lower rewards that lead to sub-optimal policies are ignored in this learning paradigm. The maximum reward reinforcement learning is subsequently modeled with the FITSK-RL model. Finally, the model is applied to an optimal stopping problem with a nature of non-cumulative rewards, and its performance is encouraging when benchmarked against other model.

论文关键词:Maximum reward,Reinforcement learning,Non-cumulative reward,FITSK-RL,Optimal stopping problem,Financial derivative pricing,Two-cycle task,Maximum reward-oriented behaviour

论文评审过程:Available online 11 October 2005.

论文官网地址:https://doi.org/10.1016/j.eswa.2005.09.054