Detecting and Preventing Hallucinations in Large Vision Language Models
Instruction tuned Large Vision Language Models (LVLMs) have made significant advancements in generalizing across a diverse set of multimodal tasks, especially for Visual Question Answering (VQA). However, generating detailed responses that are visually grounded is still a challenging task for these models. We find that even the current state-of-the-art LVLMs (InstructBLIP) still contain a staggering 30 percent of hallucinatory text in the form of non-existent objects, unfaithful descriptions, and inaccurate relationships. To address this, we introduce M-HalDetect, a Multimodal Hallucination Detection Dataset that can be used to train and benchmark models for hallucination detection and prevention. M-HalDetect consists of 16k fine-grained labels on VQA examples, making it the first comprehensive multi-modal hallucination detection dataset for detailed image descriptions. Unlike previous work that only consider object hallucination, we additionally annotate both entity descriptions and relationships that are unfaithful. To demonstrate the potential of this dataset for preference alignment, we propose fine-grained Direct Preference Optimization, as well as train fine-grained multi-modal reward models and evaluate their effectiveness with best-of-n rejection sampling. We perform human evaluation on both DPO and rejection sampling, and find that they reduce hallucination rates by 41 respectively, a significant improvement over the baseline.
READ FULL TEXT