Per-Step Reward: A New Perspective for Risk-Averse Reinforcement Learning
We present a new per-step reward perspective for risk-averse control in a discounted infinite horizon MDP. Unlike previous work, where the variance of the episodic return random variable is used for risk-averse control, we design a new random variable indicating the per-step reward and consider its variance for risk-averse control. The expectation of the per-step reward matches the expectation of the episodic return up to a constant multiplier, and the variance of the per-step reward bounds the variance of the episodic return above. Furthermore, we derive the mean-variance policy iteration framework under this per-step reward perspective, where all existing policy evaluation methods and risk-neutral control methods can be dropped in for risk-averse control off the shelf, in both on-policy and off-policy settings. We propose risk-averse PPO as an example for mean-variance policy iteration, which outperforms PPO in many Mujoco domains. By contrast, previous risk-averse control methods cannot be easily combined with advanced policy optimization techniques like PPO due to their reliance on the squared episodic return, and all those that we test suffer from poor performance in Mujoco domains with neural network function approximation.
READ FULL TEXT