Task-Relevant Object Discovery and Categorization for Playing First-person Shooter Games
We consider the problem of learning to play first-person shooter (FPS) video games using raw screen images as observations and keyboard inputs as actions. The high-dimensionality of the observations in this type of applications leads to prohibitive needs of training data for model-free methods, such as the deep Q-network (DQN), and its recurrent variant DRQN. Thus, recent works focused on learning low-dimensional representations that may reduce the need for data. This paper presents a new and efficient method for learning such representations. Salient segments of consecutive frames are detected from their optical flow, and clustered based on their feature descriptors. The clusters typically correspond to different discovered categories of objects. Segments detected in new frames are then classified based on their nearest clusters. Because only a few categories are relevant to a given task, the importance of a category is defined as the correlation between its occurrence and the agent's performance. The result is encoded as a vector indicating objects that are in the frame and their locations, and used as a side input to DRQN. Experiments on the game Doom provide a good evidence for the benefit of this approach.
READ FULL TEXT