JCDNet: Joint of Common and Definite phases Network for Weakly Supervised Temporal Action Localization
Weakly-supervised temporal action localization aims to localize action instances in untrimmed videos with only video-level supervision. We witness that different actions record common phases, e.g., the run-up in the HighJump and LongJump. These different actions are defined as conjoint actions, whose rest parts are definite phases, e.g., leaping over the bar in a HighJump. Compared with the common phases, the definite phases are more easily localized in existing researches. Most of them formulate this task as a Multiple Instance Learning paradigm, in which the common phases are tended to be confused with the background, and affect the localization completeness of the conjoint actions. To tackle this challenge, we propose a Joint of Common and Definite phases Network (JCDNet) by improving feature discriminability of the conjoint actions. Specifically, we design a Class-Aware Discriminative module to enhance the contribution of the common phases in classification by the guidance of the coarse definite-phase features. Besides, we introduce a temporal attention module to learn robust action-ness scores via modeling temporal dependencies, distinguishing the common phases from the background. Extensive experiments on three datasets (THUMOS14, ActivityNetv1.2, and a conjoint-action subset) demonstrate that JCDNet achieves competitive performance against the state-of-the-art methods. Keywords: weakly-supervised learning, temporal action localization, conjoint action
READ FULL TEXT