In 3D human shape and pose estimation from a monocular video, models tra...
Learning a recommender system model from an item's raw modality features...
Recommender systems (RS) have achieved significant success by leveraging...
We conduct the first empirical study on using knowledge transfer to impr...
The emergence of foundation models, such as large language models (LLMs)...
Generative Pre-trained Transformer (GPT) models have exhibited exciting
...
Policy optimization methods are powerful algorithms in Reinforcement Lea...
Foundation models, such as GPT-4, DALL-E have brought unprecedented AI
"...
Generalization to unseen tasks is an important ability for few-shot lear...
Text-based collaborative filtering (TCF) has become the mainstream appro...
Sensitivity to severe occlusion and large view angles limits the usage
s...
This paper addresses the temporal sentence grounding (TSG). Although exi...
Fine-tuning large pre-trained language models on downstream tasks has be...
Sparse coding refers to modeling a signal as sparse linear combinations ...
Given an untrimmed video, temporal sentence localization (TSL) aims to
l...
Temporal sentence grounding (TSG) aims to identify the temporal boundary...
In this article, we consider the problem of estimating fractional proces...
Multi-task learning (MTL) encapsulates multiple learned tasks in a singl...
Diffusion models (DMs) have shown great potential for high-quality image...
In multi-person 2D pose estimation, the bottom-up methods simultaneously...
Large-scale multi-modal contrastive pre-training has demonstrated great
...
Crowd counting is a regression task that estimates the number of people ...
To enhance the scalability and performance of the traditional
finite-dif...
Subword tokenization schemes are the dominant technique used in current ...
We consider the problem of planning with participation constraints intro...
Recent studies show that pre-trained language models (LMs) are vulnerabl...
Monocular 3D human pose estimation has made progress in recent years. Mo...
Vision transformers (ViTs) have gained increasing popularity as they are...
As a main use case of 5G and Beyond wireless network, the ever-increasin...
A summation-by-parts simultaneous approximation term (SBP-SAT)
finite-di...
Temporal video grounding (TVG) aims to localize a target segment in a vi...
Temporal sentence grounding (TSG) is crucial and fundamental for video
u...
Large-scale pre-trained language models have achieved tremendous success...
Gigantic pre-trained models have become central to natural language
proc...
We proposed a provably stable FDTD subgridding method for accurate and
e...
Large pretrained vision-language (VL) models can learn a new task with a...
Recent works have focused on compressing pre-trained language models (PL...
We explore the connection between outlier-robust high-dimensional statis...
Most existing video-and-language (VidL) research focuses on a single dat...
Vision transformers (ViTs) have recently received explosive popularity, ...
We study the problem of learning Bayesian networks where an
ϵ-fraction o...
Large-scale transformer-based pre-training has recently revolutionized
v...
We study the problem of automated mechanism design with partial verifica...
Vision-and-language pre-training has achieved impressive success in lear...
This work concerns video-language pre-training and representation learni...
Lottery Ticket Hypothesis raises keen attention to identifying sparse
tr...
Recent advances in computer vision take advantage of adversarial data
au...
This paper addresses the problem of temporal sentence grounding (TSG), w...
Few-shot text classification is a fundamental NLP task in which a model ...
Training generative adversarial networks (GANs) with limited data genera...