CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers
Vision-language models have achieved tremendous progress far beyond what we ever expected. However, their computational costs and latency are also dramatically growing with rapid development, making model acceleration exceedingly critical for researchers with limited resources and consumers with low-end devices. Although extensively studied for unimodal models, the acceleration for multimodal models, especially the vision-language Transformers, is still relatively under-explored. Accordingly, this paper proposes Cross-Guided Ensemble of Tokens (CrossGET) as a universal vison-language Transformer acceleration framework, which adaptively reduces token numbers during inference via cross-modal guidance on-the-fly, leading to significant model acceleration while keeping high performance. Specifically, the proposed CrossGET has two key designs:1) Cross-Guided Matching and Ensemble. CrossGET incorporates cross-modal guided token matching and ensemble to merge tokens effectively, only introducing cross-modal tokens with negligible extra parameters. 2) Complete-Graph Soft Matching. In contrast to the previous bipartite soft matching approach, CrossGET introduces an efficient and effective complete-graph soft matching policy to achieve more reliable token-matching results. Extensive experiments on various vision-language tasks, datasets, and model architectures demonstrate the effectiveness and versatility of the proposed CrossGET framework. The code will be at https://github.com/sdc17/CrossGET.
READ FULL TEXT