Partially observable Markov decision processes with imprecise parameters

摘要

This study extends the framework of partially observable Markov decision processes (POMDPs) to allow their parameters, i.e., the probability values in the state transition functions and the observation functions, to be imprecisely specified. It is shown that this extension can reduce the computational costs associated with the solution of these problems. First, the new framework, POMDPs with imprecise parameters (POMDPIPs), is formulated. We consider (1) the interval case, in which each parameter is imprecisely specified by an interval that indicates possible values of the parameter, and (2) the point-set case, in which each probability distribution is imprecisely specified by a set of possible distributions. Second, a new optimality criterion for POMDPIPs is introduced. As in POMDPs, the criterion is to regard a policy, i.e., an action-selection rule, as optimal if it maximizes the expected total reward. The expected total reward, however, cannot be calculated precisely in POMDPIPs, because of the parameter imprecision. Instead, we estimate the total reward by adopting arbitrary second-order beliefs, i.e., beliefs in the imprecisely specified state transition functions and observation functions. Although there are many possible choices for these second-order beliefs, we regard a policy as optimal as long as there is at least one of such choices with which the policy maximizes the total reward. Thus there can be multiple optimal policies for a POMDPIP. We regard these policies as equally optimal, and aim at obtaining one of them. By appropriately choosing which second-order beliefs to use in estimating the total reward, computational costs incurred in obtaining such an optimal policy can be reduced significantly. We provide an exact solution algorithm for POMDPIPs that does this efficiently. Third, the performance of such an optimal policy, as well as the computational complexity of the algorithm, are analyzed theoretically. Last, empirical studies show that our algorithm quickly obtains satisfactory policies to many POMDPIPs.