Reinforcement learning from human feedback (RLHF) is a technique for tra...
Research in Fairness, Accountability, Transparency, and Ethics (FATE) ha...
We provide the first formal definition of reward hacking, a phenomenon w...
Given two sources of evidence about a latent variable, one can combine t...
Reinforcement learning (RL) agents optimize only the features specified ...