Autonomous driving systems generally employ separate models for differen...
Generating 3D faces from textual descriptions has a multitude of
applica...
Stable diffusion, a generative model used in text-to-image synthesis,
fr...
Cross-modal pre-training has shown impressive performance on a wide rang...
Cross-modal garment synthesis and manipulation will significantly benefi...
Recently, large-scale diffusion models, e.g., Stable diffusion and DallE...
Recently, semantic segmentation models trained with image-level text
sup...
Recently, polar-based representation has shown promising properties in
p...
Multi-modality fusion and multi-task learning are becoming trendy in 3D
...
3D object detection from LiDAR point cloud is of critical importance for...
Given a natural language, a general robot has to comprehend the instruct...
Recent advances in text-to-image diffusion models have achieved remarkab...
In recent years, the field of computer vision has seen significant
advan...
Oriented object detection has been developed rapidly in the past few yea...
Large vision and language models, such as Contrastive Language-Image
Pre...
Medical artificial general intelligence (MAGI) enables one foundation mo...
This paper investigates policy resilience to training-environment poison...
This paper presents DetCLIPv2, an efficient and scalable training framew...
Masked Autoencoder (MAE) has demonstrated superior performance on variou...
Contrastive Language-Image Pre-training, benefiting from large-scale
unl...
Existing open-world universal segmentation approaches usually leverage C...
Benefiting from large-scale vision-language pre-training on image-text p...
Multi-task learning has emerged as a powerful paradigm to solve a range ...
Existing text-guided image manipulation methods aim to modify the appear...
Although DETR-based 3D detectors can simplify the detection pipeline and...
Large-scale cross-modal pre-training paradigms have recently shown ubiqu...
Text-guided 3D object generation aims to generate 3D objects described b...
Vision-language pre-training (VLP) has attracted increasing attention
re...
Open-world object detection, as a more general and challenging goal, aim...
Aiming towards a holistic understanding of multiple downstream tasks
sim...
Contrastive Language-Image pre-training (CLIP) learns rich representatio...
Self-supervised depth learning from monocular images normally relies on ...
Lane detection is an important component of many real-world autonomous
s...
To bridge the gap between supervised semantic segmentation and real-worl...
Vision transformers (ViTs) have pushed the state-of-the-art for various
...
A self-driving perception model aims to extract 3D semantic representati...
Unsupervised contrastive learning for indoor-scene point clouds has achi...
Self-supervised learning (SSL), especially contrastive methods, has rais...
Nowadays, owing to the superior capacity of the large pre-trained langua...
Accurate and reliable 3D detection is vital for many applications includ...
Continual learning is a challenging real-world problem for constructing ...
We present ONCE-3DLanes, a real-world autonomous driving dataset with la...
Neural Architecture Search (NAS) aims to find efficient models for multi...
Existing text-guided image manipulation methods aim to modify the appear...
We present a simple and effective framework, named Point2Seq, for 3D obj...
We present Laneformer, a conceptually simple yet powerful transformer-ba...
Contemporary deep-learning object detection methods for autonomous drivi...
Vision-language navigation (VLN) is a challenging task due to its large
...
Recently over-smoothing phenomenon of Transformer-based models is observ...
There is a growing interest in dataset generation recently due to the
su...