The top-down and bottom-up methods are two mainstreams of referring
segm...
Interactive segmentation enables users to segment as needed by providing...
Existing text-video retrieval solutions are, in essence, discriminant mo...
Unified visual grounding pursues a simple and generic technical route to...
The Position Embedding (PE) is critical for Vision Transformers (VTs) du...
Weakly supervised semantic segmentation is typically inspired by class
a...
Recently, the ability of self-supervised Vision Transformer (ViT) to
rep...
While the Vision Transformer (VT) architecture is becoming trendy in com...
In this paper, we show that the difference in Euclidean norm of samples ...