Emphatic TD Bellman Operator is a Contraction

08/14/2015
by   Assaf Hallak, et al.
0

Recently, SuttonMW15 introduced the emphatic temporal differences (ETD) algorithm for off-policy evaluation in Markov decision processes. In this short note, we show that the projected fixed-point equation that underlies ETD involves a contraction operator, with a √(γ)-contraction modulus (where γ is the discount factor). This allows us to provide error bounds on the approximation error of ETD. To our knowledge, these are the first error bounds for an off-policy evaluation algorithm under general target and behavior policies.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset