Trusted Approximate Policy Iteration with Bisimulation Metrics
Bisimulation metrics define a distance measure between states of a Markov decision process (MDP) based on a comparison of reward sequences. Due to this property they provide theoretical guarantees in value function approximation. In this work we first prove that bisimulation metrics can be defined via any p-Wasserstein metric for p≥ 1. Then we describe an approximate policy iteration (API) procedure that uses ϵ-aggregation with π-bisimulation and prove performance bounds for continuous state spaces. We bound the difference between π-bisimulation metrics in terms of the change in the policies themselves. Based on these theoretical results, we design an API(α) procedure that employs conservative policy updates and enjoys better performance bounds than the naive API approach. In addition, we propose a novel trust region approach which circumvents the requirement to explicitly solve a constrained optimization problem. Finally, we provide experimental evidence of improved stability compared to non-conservative alternatives in simulated continuous control.
READ FULL TEXT