Video-language pre-training is crucial for learning powerful multi-modal...
This paper revisits visual representation in knowledge-based visual ques...
Visual grounding, i.e., localizing objects in images according to natura...
Recent works have shown that the computational efficiency of video
recog...
Self-supervised learning has been successfully applied to pre-train vide...