Nonstochastic Multiarmed Bandits with Unrestricted Delays

06/03/2019
by   Tobias Sommer Thune, et al.
0

We investigate multiarmed bandits with delayed feedback, where the delays need neither be identical nor bounded. We first prove that the "delayed" Exp3 achieves the O(√((KT + D) K)) regret bound conjectured by Cesa-Bianchi et al. [2016], in the case of variable, but bounded delays. Here, K is the number of actions and D is the total delay over T rounds. We then introduce a new algorithm that lifts the requirement of bounded delays by using a wrapper that skips rounds with excessively large delays. The new algorithm maintains the same regret bound, but similar to its predecessor requires prior knowledge of D and T. For this algorithm we then construct a novel doubling scheme that forgoes this requirement under the assumption that the delays are available at action time (rather than at loss observation time). This assumption is satisfied in a broad range of applications, including interaction with servers and service providers. The resulting oracle regret bound is of order _β (|S_β|+β K + (KT + D_β)/β), where |S_β| is the number of observations with delay exceeding β, and D_β is the total delay of observations with delay below β. The bound relaxes to O(√((KT + D) K)), but we also provide examples where D_β≪ D and the oracle bound has a polynomially better dependence on the problem parameters.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset