Vision Transformers (ViTs) are normally regarded as a stack of transform...
Public large-scale text-to-image diffusion models, such as Stable Diffus...
Localizing people and recognizing their actions from videos is a challen...
Temporal action proposal generation (TAPG) is a fundamental and challeng...
Video-based human pose estimation in crowded scenes is a challenging pro...
Detecting and recognizing human action in videos with crowded scenes is ...
This paper presents our solution to ACM MM challenge: Large-scale
Human-...