In-context learning, which offers substantial advantages over fine-tunin...
Open-domain conversation systems integrate multiple conversation skills ...
Foundation models have shown outstanding performance and generalization
...
Learning generic joint representations for video and text by a supervise...
Human-Object Interaction (HOI) detection is the task of identifying a se...
The VALUE (Video-And-Language Understanding Evaluation) benchmark is new...
Despite recent progress on computer vision and natural language processi...
Conventional sequential learning methods such as Recurrent Neural Networ...
Conventional sequential learning methods such as Recurrent Neural Networ...
Video understanding is emerging as a new paradigm for studying human-lik...
While conventional methods for sequential learning focus on interaction
...
Bilinear models provide rich representations compared with linear models...