Finite-Sample Analysis of Off-Policy TD-Learning via Generalized Bellman Operators
In temporal difference (TD) learning, off-policy sampling is known to be more practical than on-policy sampling, and by decoupling learning from data collection, it enables data reuse. It is known that policy evaluation (including multi-step off-policy importance sampling) has the interpretation of solving a generalized Bellman equation. In this paper, we derive finite-sample bounds for any general off-policy TD-like stochastic approximation algorithm that solves for the fixed-point of this generalized Bellman operator. Our key step is to show that the generalized Bellman operator is simultaneously a contraction mapping with respect to a weighted ℓ_p-norm for each p in [1,∞), with a common contraction factor. Off-policy TD-learning is known to suffer from high variance due to the product of importance sampling ratios. A number of algorithms (e.g. Q^π(λ), Tree-Backup(λ), Retrace(λ), and Q-trace) have been proposed in the literature to address this issue. Our results immediately imply finite-sample bounds of these algorithms. In particular, we provide first-known finite-sample guarantees for Q^π(λ), Tree-Backup(λ), and Retrace(λ), and improve the best known bounds of Q-trace in [19]. Moreover, we show the bias-variance trade-offs in each of these algorithms.
READ FULL TEXT