Slowly Changing Adversarial Bandit Algorithms are Provably Efficient for Discounted MDPs

05/18/2022
by   Ian A. Kash, et al.
0

Reinforcement learning (RL) generalizes bandit problems with additional difficulties on longer planning horzion and unknown transition kernel. We show that, under some mild assumptions, any slowly changing adversarial bandit algorithm enjoys near-optimal regret in adversarial bandits can achieve near-optimal (expected) regret in non-episodic discounted MDPs. The slowly changing property required by our generalization is mild, see e.g. (Even-Dar et al. 2009, Neu et al. 2010), we also show, for example,  (Auer et al. 2002) is slowly changing and enjoys near-optimal regret in MDPs.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset