Jiahao Nie *
Nanyang Technological University |
Gongjie Zhang *
Alibaba DAMO Academy |
Wenbin An
Xi'an Jiaotong University |
Yap-Peng Tan
Nanyang Technological University |
Alex C. Kot
Nanyang Technological University |
Shijian Lu
Nanyang Technological University |
Though Multi-modal Large Language Models (MLLMs) have recently achieved significant progress, they often face various problems while handling inter-object relations, i.e., the interaction or association among distinct objects. This constraint largely stems from insufficient training and evaluation data for relation understanding, which has greatly impeded MLLMs in various vision-language generation and reasoning tasks. We attempt to address this challenge by introducing Multi-Modal Relation Understanding (MMRel), a benchmark that features large-scale, high-quality, and diverse data on inter-object relations. MMRel features three distinctive attributes: (i) It contains over 22K question-answer pairs, spanning three distinct domains and covering three relation categories, ensuring both scale and diversity; (ii) it provides manually verified, high-quality labels to ensure exceptional annotation accuracy; (iii) it includes adversarial cases with highly unusual relations, offering a challenging setting for evaluating relation hallucination. These features make MMRel ideal for evaluating MLLMs on relation understanding, as well as for fine-tuning MLLMs to enhance relation comprehension capability. Extensive experiments verify the effectiveness of MMRel in evaluating and enhancing MLLMs' relation understanding capabilities.
Though several benchmarks on inter-object relations have been created, they were not intended for assessing MLLMs' relation understanding capabilities. Specifically, most existing benchmarks suffer from obvious limitations in data scales, relation categories, and data diversity. We address this issue by creating a comprehensive benchmark on inter-object relations, aiming to gauge and enhance MLLMs' relation understanding capability in various multimodal tasks.
We introduce a Semi-automatic Data Collection pipeline (SemiDC), which is capable of annotating large-scale existing images and generating a substantial amount of high-quality synthetic images. As discussed in paper, re-labeling existing images is essential since their original labels are incompatible with MLLMs. To this end, we design SemiDC to generate high-quality relation annotations via GPT-4V for large-scale VG benchmark. This process is divided into three stages: (i) Pre-processing: We selectively exclude images featuring complex scenes that pose challenges for GPT-4V in generating accurate annotations; (ii) Re-labeling via GPT-4V: We employ the in-context learning paradigm to use GPT-4V to generate relation annotations. GPe text prompt; (iii) Human verification: We manually assess and correct the annotations that are generated by GPT-4V, to ensure the quality of the collected inter-object relation data.
Table shows the statistics of MMRel. Specifically, MMRel comprises around 22,500 question-answer pairs (15K Yes/No, and 7.5K Open-ended) across 7 subsets, spanning 3 domains and 3 categories of relations. Thanks to the open-vocabulary capability of GPT-4V, MMRel guarantees a diverse range of objects and action relations.
We employ all of the 15K Yes/No question-answer pairs in MMRel to evaluate how MLLMs perform while handling multimodal data with rich inter-object relations. As Table shows, all nine MLLMs face various problems while handling relation understanding.
As Table shows, fine-tuning with MMRel improves the capabilities of relation understanding significantly and consistently across all data domains and relation categories. In addition, fine-tuning improves the relation understanding of the adversarial subset as well.
@article{nie2024mmrel,
title={MMRel: A Relation Understanding Benchmark in the MLLM Era},
author={Nie, Jiahao and Zhang, Gongjie and An, Wenbin and Tan, Yap-Peng and Kot, Alex C and Lu, Shijian},
journal={arXiv preprint arXiv:2406.09121},
year={2024}
}
This webpage integrates components from many websites, including RefNeRF, StyleRF, and Richard Zhang's template. We sincerely thank the authors for their great work and websites.