ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System

04/27/2023

∙

Existing deep video models are limited by specific tasks, fixed input-output spaces, and poor generalization capabilities, making it difficult to deploy them in real-world scenarios. In this paper, we present our vision for multimodal and versatile video understanding and propose a prototype system, . Our system is built upon a tracklet-centric paradigm, which treats tracklets as the basic video unit and employs various Video Foundation Models (ViFMs) to annotate their properties e.g., appearance, motion, . All the detected tracklets are stored in a database and interact with the user through a database manager. We have conducted extensive case studies on different types of in-the-wild videos, which demonstrates the effectiveness of our method in answering various video-related problems. Our project is available at https://www.wangjunke.info/ChatVideo/

READ FULL TEXT

ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System

Sign in with Google

Consider DeepAI Pro