MNL-Bandit with Knapsacks
We consider a dynamic assortment selection problem where a seller has a fixed inventory of N substitutable products and faces an unknown demand that arrives sequentially over T periods. In each period, the seller needs to decide on the assortment of products (of cardinality at most K) to offer to the customers. The customer's response follows an unknown multinomial logit model (MNL) with parameters v. The goal of the seller is to maximize the total expected revenue given the fixed initial inventory of N products. We give a policy that achieves a regret of Õ(K √(K N T)(1 + √(v_max)/q_minOPT) ) under a mild assumption on the model parameters. In particular, our policy achieves a near-optimal Õ(√(T)) regret in the large inventory setting. Our policy builds upon the UCB-based approach for MNL-bandit without inventory constraints in [1] and addresses the inventory constraints through an exponentially sized LP for which we present a tractable approximation while keeping the Õ(√(T)) regret bound.
READ FULL TEXT