Slowly Changing Adversarial Bandit Algorithms are Provably Efficient for Discounted MDPs
Reinforcement learning (RL) generalizes bandit problems with additional difficulties on longer planning horzion and unknown transition kernel. We show that, under some mild assumptions, any slowly changing adversarial bandit algorithm enjoys near-optimal regret in adversarial bandits can achieve near-optimal (expected) regret in non-episodic discounted MDPs. The slowly changing property required by our generalization is mild, see e.g. (Even-Dar et al. 2009, Neu et al. 2010), we also show, for example, (Auer et al. 2002) is slowly changing and enjoys near-optimal regret in MDPs.
READ FULL TEXT