Despite their success in real data synthesis, diffusion models (DMs) oft...
End-to-end training with global optimization have popularized graph neur...
This paper addresses the temporal sentence grounding (TSG). Although exi...
Despite its success in image synthesis, we observe that diffusion
probab...
Given an untrimmed video, temporal sentence grounding (TSG) aims to loca...
While the use of graph-structured data in various fields is becoming
inc...
Temporal sentence localization in videos (TSLV) aims to retrieve the mos...
We propose to perform video question answering (VideoQA) in a Contrastiv...
Temporal sentence grounding (TSG) aims to localize the temporal segment ...
The rise of pre-trained unified foundation models breaks down the barrie...
Given an untrimmed video, temporal sentence localization (TSL) aims to
l...
Temporal sentence grounding (TSG) aims to identify the temporal boundary...
Vision-Language Pre-Training (VLP) has shown promising capabilities to a...
Distantly-Supervised Named Entity Recognition (DS-NER) effectively allev...
Although increasingly training-expensive, most self-supervised learning ...
For long-tailed classification, most works often pretrain a big model on...
As an increasingly popular task in multimedia information retrieval, vid...
This paper studies the multimedia problem of temporal sentence grounding...
Adaptive gradient algorithms borrow the moving average idea of heavy bal...
Crowd counting is a regression task that estimates the number of people ...
This paper proposes a Video Graph Transformer (VGT) model for Video Quet...
Spatial-Temporal Video Grounding (STVG) is a challenging task which aims...
For unsupervised pretraining, mask-reconstruction pretraining (MRP)
appr...
Graph neural networks (GNNs) have achieved state-of-the-art performance ...
In self-supervised learning, multi-granular features are heavily desired...
The few-shot learning ability of vision transformers (ViTs) is rarely
in...
End-to-end (E2E) speech recognition architectures assemble all component...
Temporal video grounding (TVG) aims to localize a target segment in a vi...
Temporal sentence grounding (TSG) is crucial and fundamental for video
u...
While transformers have shown great potential on video recognition tasks...
Natural language video localization (NLVL) is an important task in the
v...
Unsupervised domain adaptive person re-identification has received
signi...
Transformers have shown great potential in computer vision tasks. A comm...
Unifying acoustic and linguistic representation learning has become
incr...
A key solution to temporal sentence grounding (TSG) exists in how to lea...
We address the problem of temporal sentence localization in videos (TSLV...
For an image query, unsupervised contrastive learning labels crops of th...
Graph-level representations are critical in various real-world applicati...
The consistency of a response to a given post at semantic-level and
emot...
Most existing named entity recognition (NER) approaches are based on seq...
Though deep neural network models exhibit outstanding performance for va...
Crowd counting has drawn much attention due to its importance in
safety-...
Neural architecture search (NAS) has been successfully applied to tasks ...
This paper addresses the problem of temporal sentence grounding (TSG), w...
Deep learning techniques have achieved remarkable performance in wide-ra...
Human doctors with well-structured medical knowledge can diagnose a dise...
Low-resource automatic speech recognition (ASR) is challenging, as the
l...
Although deep learning based methods have achieved great progress in
uns...
Real data often appear in the form of multiple incomplete views. Incompl...
Due to the huge commercial interests behind online reviews, a
tremendous...