Counterfactual harm
To act safely and ethically in the real world, agents must be able to reason about harm and avoid harmful actions. In this paper we develop the first statistical definition of harm and a framework for incorporating harm into algorithmic decisions. We argue that harm is fundamentally a counterfactual quantity, and show that standard machine learning algorithms that cannot perform counterfactual reasoning are guaranteed to pursue harmful policies in certain environments. To resolve this we derive a family of counterfactual objective functions that robustly mitigate for harm. We demonstrate our approach with a statistical model for identifying optimal drug doses. While standard algorithms that select doses using causal treatment effects result in significant harm, our counterfactual algorithm identifies doses that are significantly less harmful without sacrificing efficacy. Our results show that counterfactual reasoning is a key ingredient for safe and ethical AI.
READ FULL TEXT