Near-optimal Representation Learning for Linear Bandits and Linear RL
This paper studies representation learning for multi-task linear bandits and multi-task episodic RL with linear value function approximation. We first consider the setting where we play M linear bandits with dimension d concurrently, and these bandits share a common k-dimensional linear representation so that k≪ d and k ≪ M. We propose a sample-efficient algorithm, MTLR-OFUL, which leverages the shared representation to achieve Õ(M√(dkT) + d√(kMT) ) regret, with T being the number of total steps. Our regret significantly improves upon the baseline Õ(Md√(T)) achieved by solving each task independently. We further develop a lower bound that shows our regret is near-optimal when d > M. Furthermore, we extend the algorithm and analysis to multi-task episodic RL with linear value function approximation under low inherent Bellman error <cit.>. To the best of our knowledge, this is the first theoretical result that characterizes the benefits of multi-task representation learning for exploration in RL with function approximation.
READ FULL TEXT 
  
  
     share
 share