Regret Distribution in Stochastic Bandits: Optimal Trade-off between Expectation and Tail Risk

04/10/2023
by   David Simchi-Levi, et al.
0

We study the trade-off between expectation and tail risk for regret distribution in the stochastic multi-armed bandit problem. We fully characterize the interplay among three desired properties for policy design: worst-case optimality, instance-dependent consistency, and light-tailed risk. We show how the order of expected regret exactly affects the decaying rate of the regret tail probability for both the worst-case and instance-dependent scenario. A novel policy is proposed to characterize the optimal regret tail probability for any regret threshold. Concretely, for any given α∈[1/2, 1) and β∈[0, α], our policy achieves a worst-case expected regret of Õ(T^α) (we call it α-optimal) and an instance-dependent expected regret of Õ(T^β) (we call it β-consistent), while enjoys a probability of incurring an Õ(T^δ) regret (δ≥α in the worst-case scenario and δ≥β in the instance-dependent scenario) that decays exponentially with a polynomial T term. Such decaying rate is proved to be best achievable. Moreover, we discover an intrinsic gap of the optimal tail rate under the instance-dependent scenario between whether the time horizon T is known a priori or not. Interestingly, when it comes to the worst-case scenario, this gap disappears. Finally, we extend our proposed policy design to (1) a stochastic multi-armed bandit setting with non-stationary baseline rewards, and (2) a stochastic linear bandit setting. Our results reveal insights on the trade-off between regret expectation and regret tail risk for both worst-case and instance-dependent scenarios, indicating that more sub-optimality and inconsistency leave space for more light-tailed risk of incurring a large regret, and that knowing the planning horizon in advance can make a difference on alleviating tail risks.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset