Dialogue related Machine Reading Comprehension requires language models ...
We introduce OpenIllumination, a real-world dataset containing over 108K...
Large-scale vision-language models (VLMs), e.g., CLIP, learn broad visua...
Temporal knowledge graphs, representing the dynamic relationships and
in...
Deep learning based Computer Aided Diagnosis (CAD) systems have been
dev...
Scene text image super-resolution (STISR) is an important pre-processing...
Improving the feature representation ability is the foundation of many w...
Hand trajectory forecasting from egocentric views is crucial for enablin...
Despite the impressive performance obtained by recent single-image hand
...
In this paper, we propose an approach to obtain a personalized generativ...
Model pre-training on large text corpora has been demonstrated effective...
Researchers usually come up with new ideas only after thoroughly
compreh...
Direct optimization of interpolated features on multi-resolution voxel g...
Trajectory prediction is a crucial undertaking in understanding entity
m...
Recent years have seen the ever-increasing importance of pre-trained mod...
Capturing interaction of hands with objects is important to autonomously...
Neural Radiance Fields (NeRF) have led to breakthroughs in the novel vie...
Multi-instance learning (MIL) is an effective paradigm for whole-slide
p...
Contrastive loss has been increasingly used in learning representations ...
Anomaly detection in videos is a significant yet challenging problem.
Pr...
Modeling spatial relationship in the data remains critical across many
d...
A high-quality 3D reconstruction of a scene from a collection of 2D imag...
Temporal knowledge graph, serving as an effective way to store and model...
Existing action recognition methods typically sample a few frames to
rep...
Visually exploring in a real-world 4D spatiotemporal space freely in VR ...
Synthetic datasets are often used to pretrain end-to-end optical flow
ne...
Improving fairness between privileged and less-privileged sensitive attr...
Graph-to-text (G2T) generation and text-to-graph (T2G) triple extraction...
Self-supervised monocular depth estimation has seen significant progress...
In this technical report, we represent our solution for the Human-centri...
Can we combine heterogenous graph structure with text to learn high-qual...
Multi-task learning (MTL) paradigm focuses on jointly learning two or mo...
Recent research has shown that large language models pretrained using
un...
In this paper, we present a learning-based approach for multi-view stere...
Learning self-supervised image representations has been broadly studied ...
We present a robust visual-inertial SLAM system that combines the benefi...
In this paper, we deal with the problem of monocular depth estimation fo...
We present a robust and accurate depth refinement system, named GeoRefin...
The 3D Lookup Table (3D LUT) is a highly-efficient tool for real-time im...
Recent works have empirically shown the effectiveness of data augmentati...
High-quality articulatory speech synthesis has many potential applicatio...
When applying multi-instance learning (MIL) to make predictions for bags...
Much progress has been made in the deep neural network (DNN) based diagn...
We present a novel framework named PlaneMVS for 3D plane reconstruction ...
Video instance segmentation (VIS) task requires classifying, segmenting,...
Pedestrian trajectory prediction is an essential component in a wide ran...
Aligning signals from different modalities is an important step in
visio...
Vision-language representation learning largely benefits from image-text...
Hierarchical semantic structures naturally exist in an image dataset, in...
By leveraging data from a fully labeled source domain, unsupervised doma...