While Large Language Models (LLMs) can solve many NLP tasks in zero-shot...
Current pre-trained vison-language models (PVLMs) achieve excellent
perf...
Characters are essential to the plot of any story. Establishing the
char...
The goal of this work is to understand the way actions are performed in
...
Large Language Models (LLMs) handle physical commonsense information
ina...
Coreference resolution aims at identifying words and phrases which refer...
Understanding the steps required to perform a task is an important skill...
We address the problem of data augmentation for video action recognition...
Scene graph generation (SGG) aims to capture a wide variety of interacti...
Movie trailers perform multiple functions: they introduce viewers to the...
Recent language models can generate interesting and grammatically correc...
Measuring event salience is essential in the understanding of stories. T...
Zero-shot action recognition is the task of classifying action categorie...
Zero-shot action recognition is the task of recognizing action classes
w...
We summarize full-length movies by creating shorter videos containing th...
Transformer-based pre-trained language models (PLMs) have dramatically
i...
Suspense is a crucial ingredient of narrative fiction, engaging readers ...
Most general-purpose extractive summarization models are trained on news...
According to screenwriting theory, turning points (e.g., change of plans...
Recently, there has been an increasing interest in unsupervised parsers ...
Recent work has shown that visual context improves cross-lingual sense
d...
Intuitively, human readers cope easily with errors in text; typos,
missp...
Dependency grammar induction is the task of learning dependency syntax
w...
Humans read by making a sequence of fixations and saccades. They often s...
Manually annotating object bounding boxes is central to building compute...
In this paper we propose a model to learn multimodal multilingual
repres...
A large amount of recent research has focused on tasks that combine lang...
When humans read text, they fixate some words and skip others. However, ...
We introduce a new task, visual sense disambiguation for verbs: given an...
Training object class detectors typically requires a large set of images...
Automatic description generation from natural images is a challenging pr...