On Submodular Contextual Bandits

12/03/2021
βˆ™
by   Dean P. Foster, et al.
βˆ™
5
βˆ™

We consider the problem of contextual bandits where actions are subsets of a ground set and mean rewards are modeled by an unknown monotone submodular function that belongs to a class β„±. We allow time-varying matroid constraints to be placed on the feasible sets. Assuming access to an online regression oracle with regret 𝖱𝖾𝗀(β„±), our algorithm efficiently randomizes around local optima of estimated functions according to the Inverse Gap Weighting strategy. We show that cumulative regret of this procedure with time horizon n scales as O(√(n 𝖱𝖾𝗀(β„±))) against a benchmark with a multiplicative factor 1/2. On the other hand, using the techniques of (Filmus and Ward 2014), we show that an Ο΅-Greedy procedure with local randomization attains regret of O(n^2/3𝖱𝖾𝗀(β„±)^1/3) against a stronger (1-e^-1) benchmark.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset