Visual Perception by Large Language Model's Weights

Project leader *Corresponding authors

1University of Science and Technology of China 2WeChat, Tencent Inc.
3Show Lab, National University of Singapore 4Fudan University
5Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
NeurIPS 2024
💡💡💡
We propose 💡a novel parameter space alignment paradigm for MLLMs to address the inefficiency of input space alignment paradigm in visual perception, introducing VLoRA that 💡converts visual features to LoRA weights, 💡achieving comparable performance on various benchmarks while 💡significantly reducing computational costs for training and inference.


Interpolate start reference image.

Figure 1. Comparison between traditional input space alignment methods and our proposed VLoRA, a novel parameter space alignment approach.

Abstract

Existing Multimodal Large Language Models (MLLMs) follow the paradigm that perceives visual information by aligning visual features with the input space of Large Language Models (LLMs), and concatenating visual tokens with text tokens to form a unified sequence input for LLMs. These methods demonstrate promising results on various vision-language tasks but are limited by the high computational effort due to the extended input sequence resulting from the involvement of visual tokens. In this paper, instead of input space alignment, we propose a novel parameter space alignment paradigm that represents visual information as model weights. For each input image, we use a vision encoder to extract visual features, convert features into perceptual weights, and merge the perceptual weights with LLM's weights. In this way, the input of LLM does not require visual tokens, which reduces the length of the input sequence and greatly improves efficiency. Following this paradigm, we propose VLoRA with the perceptual weights generator. The perceptual weights generator is designed to convert visual features to perceptual weights with low-rank property, exhibiting a form similar to LoRA. The experimental results show that our VLoRA achieves comparable performance on various benchmarks for MLLMs, while significantly reducing the computational costs for both training and inference. The code and models will be made open-source.

Method

Interpolate start reference image.

Figure 2. Perceptual Weights Generator. Figure (a) illustrates the pipeline of our perceptual weights generator. We set k learnable perceptual queries, which interact with image features in N decoder blocks, and obtain k visual parameters. Then, a shared linear layer and k independent linear layers are used to convert these visual parameters to perceptual weights ∆W. Figure (b) demonstrates that our approach is formally consistent with LoRA.

Experiments

Interpolate start reference image.

Table 1. Comparisons on six MLLM benchmarks, including MMBench, MME, ScienceQA, HallusionBench, MMMU, and CCBench. vis. tok. denotes the number of visual tokens involved in the LLM. Bolded numbers indicate the best results, and underlined numbers are the second-best results. GFLOPs denotes the overhead of the LLM part when the number of input text tokens is 32.

BibTeX

@article{ma2024visual,
  author    = {Ma, Feipeng and Xue, Hongwei and Wang, Guangting and Zhou, Yizhou and Rao, Fengyun and Yan, Shilin and Zhang, Yueyi and Wu, Siying and Shou, Mike Zheng and Sun, Xiaoyan},
  title     = {Visual Perception by Large Language Model's Weights},
  journal   = {NeurIPS},
  year      = {2024},
}