What will the future be? We wonder! In this survey, we explore the gap
b...
Neural rendering is fuelling a unification of learning, 3D geometry and ...
We propose and address a new generalisation problem: can a model trained...
We propose a novel multimodal video benchmark - the Perception Test - to...
This paper presents an investigation into long-tail video recognition. W...
We introduce EPIC-SOUNDS, a large-scale dataset of audio annotations
cap...
Current one-stage action detection methods, which simultaneously predict...
A key function of auditory cognition is the association of characteristi...
In this paper, we re-examine the task of cross-modal clip-sentence retri...
We introduce VISOR, a new dataset of pixel annotations and a benchmark s...
We propose a novel approach to multimodal sensor fusion for Ambient Assi...
In this paper, we evaluate state-of-the-art OCR methods on Egocentric da...
Early action prediction deals with inferring the ongoing action from
par...
We introduce a segmentation-guided approach to synthesise images that
in...
This paper proposes an interaction reasoning network for modelling
spati...
We propose a Temporal Voting Network (TVNet) for action localization in
...
In egocentric videos, actions occur in quick succession. We capitalise o...
Given a gallery of uncaptioned video sequences, this paper considers the...
Current video retrieval efforts all found their evaluation on an
instanc...
We propose a two-stream convolutional network for audio recognition, tha...
We propose a novel approach to few-shot action recognition, finding
temp...
We propose a three-dimensional discrete and incremental scale to encode ...
Meta-learning approaches have addressed few-shot problems by finding
ini...
Fine-grained action recognition datasets exhibit environmental bias, whe...
We present a method to learn a representation for adverbs from instructi...
Monitoring the progression of an action towards completion offers fine
g...
We present the first fully automated Sit-to-Stand or Stand-to-Sit (StS)
...
We investigate video transforms that result in class-homogeneous
label-t...
We focus on multi-modal fusion for egocentric action recognition, and pr...
We address the problem of cross-modal fine-grained action retrieval betw...
We benchmark contemporary action recognition models (TSN, TRN, and TSM) ...
This work introduces verb-only representations for both recognition and
...
Domain alignment in convolutional networks aims to learn the degree of
l...
Recognising actions in videos relies on labelled supervision during trai...
We present a new model to determine relative skill from long videos, thr...
We propose a novel deep fusion architecture, CaloriNet, for the online
e...
We present a deep person re-identification approach that combines
semant...
The SPHERE project has developed a multi-modal sensor platform for healt...
We introduce completion moment detection for actions - the problem of
lo...
This work introduces verb-only representations for actions and interacti...
First-person vision is gaining interest as it offers a unique viewpoint ...
Action completion detection is the problem of modelling the action's
pro...
This paper presents a method for assessing skill of performance from vid...
Manual annotations of temporal bounds for object interactions (i.e. star...
This work deviates from easy-to-define class boundaries for object
inter...
We present SEMBED, an approach for embedding an egocentric object intera...
We present a new framework for vision-based estimation of calorific
expe...
Multiple human tracking (MHT) is a fundamental task in many computer vis...
This paper presents an unsupervised approach towards automatically extra...