Sample-Efficient Reinforcement Learning with loglog(T) Switching Cost

02/13/2022
by   Dan Qiao, et al.
0

We study the problem of reinforcement learning (RL) with low (policy) switching cost - a problem well-motivated by real-life RL applications in which deployments of new policies are costly and the number of policy updates must be low. In this paper, we propose a new algorithm based on stage-wise exploration and adaptive policy elimination that achieves a regret of O(√(H^4S^2AT)) while requiring a switching cost of O(HSA loglog T). This is an exponential improvement over the best-known switching cost O(H^2SAlog T) among existing methods with O(poly(H,S,A)√(T)) regret. In the above, S,A denotes the number of states and actions in an H-horizon episodic Markov Decision Process model with unknown transitions, and T is the number of steps. We also prove an information-theoretical lower bound which says that a switching cost of Ω(HSA) is required for any no-regret algorithm. As a byproduct, our new algorithmic techniques allow us to derive a reward-free exploration algorithm with an optimal switching cost of O(HSA).

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset

Sign in with Google

×

Use your Google Account to sign in to DeepAI

×

Consider DeepAI Pro