Exploiting Semantic Contextualization for Interpretation of Human Activity in Videos
We use large-scale commonsense knowledge bases, e.g. ConceptNet, to provide context cues to establish semantic relationships among entities directly hypothesized from video signal, such as putative object and actions labels, and infer a deeper interpretation of events than what is directly sensed. One approach is to learn semantic relationships between objects and actions from training annotations of videos and as such, depend largely on statistics of the vocabulary in these annotations. However, the use of prior encoded commonsense knowledge sources alleviates this dependence on large annotated training datasets. We represent interpretations using a connected structure of basic detected (grounded) concepts, such as objects and actions, that are bound by semantics with other background concepts not directly observed, i.e. contextualization cues. We mathematically express this using the language of Grenander's pattern generator theory. Concepts are basic generators and the bonds are defined by the semantic relationships between concepts. We formulate an inference engine based on energy minimization using an efficient Markov Chain Monte Carlo that uses the ConceptNet in its move proposals to find these structures. Using three different publicly available datasets, Breakfast, CMU Kitchen and MSVD, whose distribution of possible interpretations span more than 150000 possible solutions for over 5000 videos, we show that the proposed model can generate video interpretations whose quality are comparable or better than those reported by approaches such as discriminative approaches, hidden Markov models, context free grammars, deep learning models, and prior pattern theory approaches, all of whom rely on learning from domain-specific training data.
READ FULL TEXT