An Efficient Spatio-Temporal Pyramid Transformer for Action Detection

07/21/2022
by   Yuetian Weng, et al.
1

The task of action detection aims at deducing both the action category and localization of the start and end moment for each action instance in a long, untrimmed video. While vision Transformers have driven the recent advances in video understanding, it is non-trivial to design an efficient architecture for action detection due to the prohibitively expensive self-attentions over a long sequence of video clips. To this end, we present an efficient hierarchical Spatio-Temporal Pyramid Transformer (STPT) for action detection, building upon the fact that the early self-attention layers in Transformers still focus on local patterns. Specifically, we propose to use local window attention to encode rich local spatio-temporal representations in the early stages while applying global attention modules to capture long-term space-time dependencies in the later stages. In this way, our STPT can encode both locality and dependency with largely reduced redundancy, delivering a promising trade-off between accuracy and efficiency. For example, with only RGB input, the proposed STPT achieves 53.6 and performing favorably against state-of-the-art AFSD that uses additional flow features with 31 end-to-end Transformer-based framework for action detection.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset