Online Learning in Adversarial MDPs: Is the Communicating Case Harder than Ergodic?

11/03/2021
by   Gautam Chandrasekaran, et al.
0

We study online learning in adversarial communicating Markov Decision Processes with full information. We give an algorithm that achieves a regret of O(√(T)) with respect to the best fixed deterministic policy in hindsight when the transitions are deterministic. We also prove a regret lower bound in this setting which is tight up to polynomial factors in the MDP parameters. We also give an inefficient algorithm that achieves O(√(T)) regret in communicating MDPs (with an additional mild restriction on the transition dynamics).

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset

Sign in with Google

×

Use your Google Account to sign in to DeepAI

×

Consider DeepAI Pro