This repo contains evaluation code for the paper "VisNumBench: Evaluating Number Sense of Multimodal Large Language Models"
🌐 Homepage | 🤗 Dataset | 📑 Paper | 📖 arXiv
Can Multimodal Large Language Models (MLLMs) develop an intuitive number sense similar to humans? Targeting this problem, we introduce Visual Number Benchmark (VisNumBench) to evaluate the number sense abilities of MLLMs across a wide range of visual numerical tasks. VisNumBench consists of about 1900 multiple-choice question-answer pairs derived from both synthetic and real-world visual data, covering seven visual numerical attributes and four types of visual numerical estimation tasks. Our experiments on VisNumBench led to the following key findings: (i) The 17 MLLMs we tested—including open-source models such as Qwen2.5-VL and InternVL2.5, as well as proprietary models like GPT-4o and Gemini 2.0 Flash—perform significantly below human levels in number sense-related tasks. (ii) Multimodal mathematical models and multimodal chain-of-thought (CoT) models did not exhibit significant improvements in number sense abilities. (iii) Stronger MLLMswith larger parameter sizes and broader general abilities demonstrate modest gains in number sense abilities. We believe VisNumBench will serve as a valuable resource for the research community, encouraging further advancements in enhancing LVLMs' number sense abilities.
VisNumBench aims to advance the development of multimodal large language models in visual numerical understanding by evaluating their number sense capabilities. This benchmark is dedicated to bridging the gap between abstract mathematical problem-solving and real-world applications in current multimodal models. Please refer to our huggingface 🤗 Dataset for more details.
from datasets import load_dataset
# Login using e.g. `huggingface-cli login` to access this dataset
ds = load_dataset("GML-FMGroup/VisNumBench")
Attributes
is one of the attributes: ['Angle', 'Length', 'Scale', 'Depth', 'Quantity', 'Depth', 'Area', 'Volume']
;
task_class
is one of the tasks: ['Range Estimation', 'Value Comparison', 'Value Estimation', 'Multiplicative Estimation']
.
Please refer to our eval folder for more details.
Model | VisNumBench-Synthetic (1,011) | VisNumBench-Real (902) | VisNumBench |
---|---|---|---|
🏅Human | 95.33 | 97.33 | 96.27 |
🥈Gemini 2.0 Flash | 57.57 | 56.54 | 57.08 |
🥉InternVL2.5-78B | 56.18 | 56.54 | 56.35 |
Qwen2.5-VL-72B | 58.46 | 53.33 | 56.04 |
InternVL2.5-38B | 55.59 | 52.11 | 53.95 |
LLaVA-Onevision-72B | 50.84 | 50.78 | 50.81 |
Qwen2-VL-72B | 54.20 | 46.56 | 50.60 |
LLaVA-v1.6-34B | 44.31 | 50.55 | 47.25 |
Gemini 1.5 Pro | 44.02 | 48.67 | 46.21 |
InternVL2-40B | 45.50 | 45.12 | 45.32 |
Qwen2.5-VL-7B | 46.19 | 41.02 | 43.75 |
Llama-VL-3_2-11B | 43.92 | 43.24 | 43.60 |
Qwen2.5-VL-3B | 42.43 | 42.57 | 42.50 |
Qwen2-VL-7B | 42.24 | 41.91 | 42.08 |
Llama-3.2V-11B-cot | 45.50 | 38.36 | 42.13 |
GPT-4o | 43.72 | 39.58 | 41.77 |
InternVL2-8B-MPO | 40.65 | 39.91 | 40.30 |
LLaVA-Onevision-7B | 39.96 | 40.58 | 40.25 |
InternVL2.5-8B | 39.66 | 40.13 | 39.88 |
InternVL2-8B | 39.56 | 39.58 | 39.57 |
R1-Onevision-7B | 38.87 | 38.25 | 38.58 |
LLaVA-v1.5-13B | 32.15 | 40.02 | 35.86 |
Janus-Pro-7B | 37.69 | 34.26 | 36.07 |
Phi-3.5-vision | 32.34 | 37.25 | 34.66 |
Math-LLaVA-13B | 35.81 | 33.15 | 34.56 |
Gemini 1.5 Flash | 33.33 | 33.70 | 33.50 |
LLaVA-v1.5-7B | 29.38 | 28.49 | 28.96 |
Qwen2-VL-2B | 31.85 | 24.94 | 28.59 |
👀 Random | 24.76 | 25.54 | 25.13 |
VisNumBench utilizes image data from multiple sources. We have made every effort to ensure that the images included in this work comply with applicable copyright laws and are properly credited. However, if you are the copyright holder of any image included in our work and believe that its use conflicts with your licensing agreement, please contact us directly. We are committed to promptly addressing any legitimate concerns.
- Tengjin Weng: [email protected]
- Wenhao Jiang:[email protected]
BibTeX:
@article{weng2025visnumbench,
title={VisNumBench: Evaluating Number Sense of Multimodal Large Language Models},
author={Tengjin Weng and Wenhao Jiang and Jingyi Wang and Zhong Ming},
journal={arXiv preprint arXiv:2503.14939},
year={2025}
}