On Submodular Contextual Bandits
We consider the problem of contextual bandits where actions are subsets of a ground set and mean rewards are modeled by an unknown monotone submodular function that belongs to a class β±. We allow time-varying matroid constraints to be placed on the feasible sets. Assuming access to an online regression oracle with regret π±πΎπ(β±), our algorithm efficiently randomizes around local optima of estimated functions according to the Inverse Gap Weighting strategy. We show that cumulative regret of this procedure with time horizon n scales as O(β(n π±πΎπ(β±))) against a benchmark with a multiplicative factor 1/2. On the other hand, using the techniques of (Filmus and Ward 2014), we show that an Ο΅-Greedy procedure with local randomization attains regret of O(n^2/3π±πΎπ(β±)^1/3) against a stronger (1-e^-1) benchmark.
READ FULL TEXT