Point-supervised Temporal Action Localization (PSTAL) is an emerging res...
Relational Language-Image Pre-training (RLIP) aims to align vision
repre...
This paper introduces ModelScopeT2V, a text-to-video synthesis model tha...
Event forecasting has been a demanding and challenging task throughout t...
Multi-modal recommendation systems, which integrate diverse types of
inf...
This paper identifies two kinds of redundancy in the current VideoQA
par...
Cross-domain few-shot classification (CD-FSC) aims to identify novel tar...
This paper strives to solve complex video question answering (VideoQA) w...
Discovering causal structure from purely observational data (i.e., causa...
The pursuit of controllability as a higher standard of visual content
cr...
With the greater emphasis on privacy and security in our society, the pr...
With the accelerated adoption of end-to-end encryption, there is an
oppo...
Current state-of-the-art approaches for few-shot action recognition achi...
Under stringent model type and variable distribution assumptions,
differ...
Learning from large-scale contrastive language-image pre-training like C...
Collaborative Filtering (CF) models, despite their great success, suffer...
Knowledge Graph (KG), as a side-information, tends to be utilized to
sup...
Visual anomaly detection plays a crucial role in not only manufacturing
...
Deep Convolutional Neural Networks (DCNNs) have exhibited impressive
per...
Face super-resolution is a domain-specific image super-resolution, which...
Out-of-distribution (OOD) generalization on graphs is drawing widespread...
Recent incremental learning for action recognition usually stores
repres...
Mixup is a data augmentation technique that relies on training using ran...
V. Levenshtein first proposed the sequence reconstruction problem in 200...
In recent years, head-mounted near-eye display devices have become the k...
Collaborative filtering (CF) models easily suffer from popularity bias, ...
Recently, researchers observed that gradient descent for deep neural net...
Rice is one of the main staple food in many areas of the world. The qual...
Monotonic linear interpolation (MLI) - on the line connecting a random
i...
Video Question Answering (VideoQA) is the task of answering the natural
...
Standard approaches for video recognition usually operate on the full in...
For the past 25 years, we have witnessed an extensive application of Mac...
This technical report presents our first place winning solution for temp...
Leading graph contrastive learning (GCL) methods perform graph augmentat...
Video Question Answering (VideoQA) is the task of answering questions ab...
Bundle recommendation aims to recommend a bundle of related items to use...
Learning causal structure from observational data is a fundamental chall...
Unsupervised anomaly detection and localization, as of one the most prac...
A family of quadratic finite volume method (FVM) schemes are constructed...
One compelling application of artificial intelligence is to generate a v...
Generating synchronized and natural lip movement with speech is one of t...
Most recommender systems optimize the model on observed interaction data...
Explainability is crucial for probing graph neural networks (GNNs), answ...
Rumor detection has become an emerging and active research field in rece...
Multi-frame human pose estimation has long been a compelling and fundame...
Intrinsic interpretability of graph neural networks (GNNs) is to find a ...
Explainability of graph neural networks (GNNs) aims to answer “Why the G...
We present TFGM (Training Free Graph Matching), a framework to boost the...
Learning objectives of recommender models remain largely unexplored. Mos...
Learning powerful representations is one central theme of graph neural
n...