Composed image retrieval (CIR) is a new and flexible image retrieval
par...
The integration of emotional support into various conversational scenari...
This paper aims to tackle a novel task - Temporal Sentence Grounding in
...
In the text-to-image generation field, recent remarkable progress in Sta...
Multimodal recommendation exploits the rich multimodal information assoc...
Emotion distribution learning has gained increasing attention with the
t...
Training an effective video action recognition model poses significant
c...
The existing deepfake detection methods have reached a bottleneck in
gen...
Existing work on Multimodal Sentiment Analysis (MSA) utilizes multimodal...
Multimodal Sarcasm Explanation (MuSE) is a new yet challenging task, whi...
Existing deepfake detection methods fail to generalize well to unseen or...
Textual response generation is an essential task for multimodal task-ori...
The composed image retrieval (CIR) task aims to retrieve the desired tar...
Existing data-to-text generation efforts mainly focus on generating a
co...
Dialogue-based language models mark a huge milestone in the field of
art...
For natural image matting, context information plays a crucial role in
e...
The last decade has witnessed the proliferation of micro-videos on vario...
Under the flourishing development in performance, current image-text
ret...
Visual Commonsense Reasoning (VCR) remains a significant yet challenging...
The booming development and huge market of micro-videos bring new e-comm...
Knowledge Graph (KG), as a side-information, tends to be utilized to
sup...
Fake news often involves multimedia information such as text and image t...
Pre-trained Language Models (PLMs) which are trained on large text corpu...
Recommendation systems make predictions chiefly based on users' historic...
Recently, Deepfake has drawn considerable public attention due to securi...
Several studies have recently pointed that existing Visual Question Answ...
Existing studies on multimodal sentiment analysis heavily rely on textua...
Visual Question Answering (VQA) is fundamentally compositional in nature...
Text response generation for multimodal task-oriented dialog systems, wh...
Relying on the premise that the performance of a binary neural network c...
Knowledge-based Visual Question Answering (VQA) expects models to rely o...
Recommender systems usually face the issue of filter bubbles:
overrecomm...
In the past few years, cross-modal image-text retrieval (ITR) has experi...
Scene Graph Generation, which generally follows a regular encoder-decode...
Many multimodal recommender systems have been proposed to exploit the ri...
Detecting forgery videos is highly desirable due to the abuse of deepfak...
Logical reasoning is of vital importance to natural language understandi...
Making each modality in multi-modal data contribute is of vital importan...
Visual Commonsense Reasoning (VCR), deemed as one challenging extension ...
With the remarkable success of deep learning recently, efficient network...
The ubiquity of implicit feedback makes it indispensable for building
re...
Temporal Moment Localization (TML) in untrimmed videos is a challenging ...
This paper focuses on tackling the problem of temporal language localiza...
Knowledge distillation has become one of the most important model compre...
Personalization lies at the core of boosting the product search system
p...
Recommending cold-start items is a long-standing and fundamental challen...
Utilizing review information to enhance recommendation, the de facto
rev...
Pre-trained Language Models (PLMs) have achieved great success on Machin...
A number of studies point out that current Visual Question Answering (VQ...
In recent years, conversational agents have provided a natural and conve...