Cordelia Schmid
Research director
Composed Image Retrieval (CoIR) has recently gained popularity as a task...
The regression of 3D Human Pose and Shape (HPS) from an image is becomin...
While large-scale image-text pretrained models such as CLIP have been us...
Object goal navigation aims to navigate an agent to locations of a given...
Learning visuomotor policies in simulation is much safer and cheaper tha...
We aim to investigate whether end-to-end learning of visual reasoning ca...
We propose a new task and model for dense video object captioning –
dete...
Current state-of-the-art video models process a video clip as a long seq...
In this paper, we propose an autonomous information seeking visual quest...
The visual classification performance of vision-language models such as ...
Contrastive image-text models such as CLIP form the building blocks of m...
We present a framework that formulates visual question answering as modu...
The ability to specify robot commands by a non-expert user is critical f...
The most performant spatio-temporal action localisation models use exter...
Signed distance functions (SDFs) is an attractive framework that has rec...
Understanding verbs is crucial to modelling how people and objects inter...
Physics simulation is ubiquitous in robotics. Whether in model-based
app...
Retrieval augmented models are becoming increasingly popular for compute...
Cross-modal retrieval methods are the preferred tool to search databases...
Due to the expensive costs of collecting labels in multi-label classific...
Audiovisual automatic speech recognition (AV-ASR) aims to improve the
ro...
In this work, we introduce Vid2Seq, a multi-modal single-stage dense eve...
One of the major challenges of machine translation (MT) is ambiguity, wh...
In this paper, we propose an end-to-end Retrieval-Augmented Visual Langu...
Can we leverage the audiovisual information already present in video to
...
Pixel-level labels are particularly expensive to acquire. Hence, pretrai...
This paper presents WALDO (WArping Layer-Decomposed Objects), a novel
ap...
In this report, we describe our submission to the Ego4D AudioVisual (AV)...
Localizing objects in 3D scenes based on natural language requires
under...
Observing a human demonstrator manipulate objects provides a rich, scala...
We study class-incremental learning, a training setup in which new class...
Reinforcement learning (RL) and trajectory optimization (TO) present str...
In human environments, robots are expected to accomplish a variety of
ma...
In vision-and-language navigation (VLN), an embodied agent is required t...
YouTube users looking for instructions for a specific task may spend a l...
Recent work achieved impressive progress towards joint reconstruction of...
Transfer learning is the predominant paradigm for training deep networks...
In the past few years, following the differentiable programming paradigm...
This report describes the approach behind our winning solution to the 20...
Video question answering (VideoQA) is a complex task that requires diver...
Audio-visual automatic speech recognition (AV-ASR) is an extension of AS...
Recent methods for visual question answering rely on large-scale annotat...
Visual grounding localizes regions (boxes or segments) in the image
corr...
This paper addresses the problem of copying an unknown assembly of primi...
A major challenge in text-video and text-audio retrieval is the lack of
...
We consider the problem of localizing a spatio-temporal tube in a video
...
Optimal control (OC) algorithms such as Differential Dynamic Programming...
Both a good understanding of geometrical concepts and a broad familiarit...
Following language instructions to navigate in unseen environments is a
...
Recent advances in deep learning have relied on large, labelled datasets...