Addressing Value Estimation Errors in Reinforcement Learning with a State-Action Return Distribution Function
In current reinforcement learning (RL) methods, function approximation errors are known to lead to the overestimated or underestimated state-action values Q, which further lead to suboptimal policies. We show that the learning of a state-action return distribution function can be used to improve the estimation accuracy of the Q-value. We combine the distributional return function within the maximum entropy RL framework in order to develop what we call the Distributional Soft Actor-Critic algorithm, DSAC, which is an off-policy method for continuous control setting. Unlike traditional distributional Q algorithms which typically only learn a discrete return distribution, DSAC can directly learn a continuous return distribution by truncating the difference between the target and current return distribution to prevent gradient explosion. Additionally, we propose a new Parallel Asynchronous Buffer-Actor-Learner architecture (PABAL) to improve the learning efficiency. We evaluate our method on the suite of MuJoCo continuous control tasks, achieving the state of the art performance.
READ FULL TEXT