Fast Global Convergence of Policy Optimization for Constrained MDPs

10/31/2021

∙

We address the issue of safety in reinforcement learning. We pose the problem in a discounted infinite-horizon constrained Markov decision process framework. Existing results have shown that gradient-based methods are able to achieve an 𝒪(1/√(T)) global convergence rate both for the optimality gap and the constraint violation. We exhibit a natural policy gradient-based algorithm that has a faster convergence rate 𝒪(log(T)/T) for both the optimality gap and the constraint violation. When Slater's condition is satisfied and known a priori, zero constraint violation can be further guaranteed for a sufficiently large T while maintaining the same convergence rate.

READ FULL TEXT

Fast Global Convergence of Policy Optimization for Constrained MDPs

Sign in with Google

Consider DeepAI Pro