In an era where images and visual content dominate our digital landscape...
Attaining a high degree of user controllability in visual generation oft...
The ability to assist humans during a navigation task in a supportive ro...
The field of text-to-image (T2I) generation has garnered significant
att...
Existing automatic evaluation on text-to-image synthesis can only provid...
Diffusion models, such as Stable Diffusion, have shown incredible perfor...
Embodied agents have achieved prominent performance in following human
i...
Pre-trained vision and language models such as CLIP have witnessed remar...
Despite the success of Transformer models in vision and language tasks, ...
The ability to accurately locate and navigate to a specific object is a
...
Large-scale diffusion models have achieved state-of-the-art results on
t...
Federated embodied agent learning protects the data privacy of individua...
Contrastive Language-Image Pretraining (CLIP) has demonstrated great
zer...
Prompt tuning is a new few-shot transfer learning technique that only tu...
Recent advances in text-to-image synthesis make it possible to visualize...
Vision-Language Navigation requires the agent to follow natural language...
Building a conversational embodied agent to execute real-life tasks has ...
Benefiting from language flexibility and compositionality, humans natura...
Language planning aims to implement complex high-level goals by decompos...
The ability to converse with humans and follow commands in natural langu...
Human brains integrate linguistic and perceptual information simultaneou...
In computer vision, it has achieved great success in adapting large-scal...
Data privacy is a central problem for embodied agents that can perceive ...
Temporal grounding in videos aims to localize one target video segment t...
A long-term goal of AI research is to build intelligent agents that can
...
Grounded video description (GVD) encourages captioning models to attend ...
The aim of gaze redirection is to manipulate the gaze in an image to the...
Automatic evaluations for natural language generation (NLG) conventional...
Most existing video-and-language (VidL) research focuses on a single dat...
Despite having promising results, style transfer, which requires prepari...
Video editing tools are widely used nowadays for digital design. Althoug...
Vision-and-language navigation (VLN) is a multimodal task where an agent...
Recent advances in language and vision push forward the research of
capt...
A major challenge in visually grounded language generation is to build r...
Vision-and-Language Navigation (VLN) is a natural language grounding tas...
Iterative Language-Based Image Editing (IL-BIE) tasks follow iterative
i...