Stories in the Eye: Contextual Visual Interactions for Efficient Video to Language Translation
Integrating higher level visual and linguistic interpretations is at the heart of human intelligence. As automatic visual category recognition in images is approaching human performance, the high level understanding in the dynamic spatiotemporal domain of videos and its translation into natural language is still far from being solved. While most works on vision-to-text translations use pre-learned or pre-established computational linguistic models, in this paper we present an approach that uses vision alone to efficiently learn how to translate into language the video content. We discover, in simple form, the story played by main actors, while using only visual cues for representing objects and their interactions. Our method learns in a hierarchical manner higher level representations for recognizing subjects, actions and objects involved, their relevant contextual background and their interaction to one another over time. We have a three stage approach: first we take in consideration features of the individual entities at the local level of appearance, then we consider the relationship between these objects and actions and their video background, and third, we consider their spatiotemporal relations as inputs to classifiers at the highest level of interpretation. Thus, our approach finds a coherent linguistic description of videos in the form of a subject, verb and object based on their role played in the overall visual story learned directly from training data, without using a known language model. We test the efficiency of our approach on a large scale dataset containing YouTube clips taken in the wild and demonstrate state-of-the-art performance, often superior to current approaches that use more complex, pre-learned linguistic knowledge.
READ FULL TEXT