Consistency driven Sequential Transformers Attention Model for Partially Observable Scenes

04/01/2022
by   Samrudhdhi B Rangrej, et al.
10

Most hard attention models initially observe a complete scene to locate and sense informative glimpses, and predict class-label of a scene based on glimpses. However, in many applications (e.g., aerial imaging), observing an entire scene is not always feasible due to the limited time and resources available for acquisition. In this paper, we develop a Sequential Transformers Attention Model (STAM) that only partially observes a complete image and predicts informative glimpse locations solely based on past glimpses. We design our agent using DeiT-distilled and train it with a one-step actor-critic algorithm. Furthermore, to improve classification performance, we introduce a novel training objective, which enforces consistency between the class distribution predicted by a teacher model from a complete image and the class distribution predicted by our agent using glimpses. When the agent senses only 4 our training objective yields 3 datasets, respectively. Moreover, our agent outperforms previous state-of-the-art by observing nearly 27 ImageNet and fMoW.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset