Learning from Suboptimal Demonstration via Self-Supervised Reward Regression
Learning from Demonstration (LfD) seeks to democratize robotics by enabling non-roboticist end-users to teach robots to perform a task by providing a human demonstration. However, modern LfD techniques, such as inverse reinforcement learning (IRL), assume users provide at least stochastically optimal demonstrations. This assumption fails to hold in all but the most isolated, controlled scenarios, reducing the ability to achieve the goal of empowering real end-users. Recent attempts to learn from sub-optimal demonstration leverage pairwise rankings through Preference-based Reinforcement Learning (PbRL) to infer a more optimal policy than the demonstration. However, we show that these approaches make incorrect assumptions and, consequently, suffer from brittle, degraded performance. In this paper, we overcome the limitations of prior work by developing a novel computational technique that infers an idealized reward function from suboptimal demonstration and bootstraps suboptimal demonstrations to synthesize optimality-parameterized training data for training our reward function. We empirically validate we can learn an idealized reward function with ∼0.95 correlation with the ground truth reward versus only ∼ 0.75 for prior work. We can then train policies achieving ∼ 200% improvement over the suboptimal demonstration and ∼ 90% improvement over prior work. Finally, we present a real-world implementation for teaching a robot to hit a topspin shot in table tennis better than user demonstration.
READ FULL TEXT