An Adiabatic Theorem for Policy Tracking with TD-learning

10/24/2020
by   Neil Walton, et al.
0

We evaluate the ability of temporal difference learning to track the reward function of a policy as it changes over time. Our results apply a new adiabatic theorem that bounds the mixing time of time-inhomogeneous Markov chains. We derive finite-time bounds for tabular temporal difference learning and Q-learning when the policy used for training changes in time. To achieve this, we develop bounds for stochastic approximation under asynchronous adiabatic updates.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset