Official Codebase for "Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following"
[🌐 Webpage] [📖 Paper] [🤗 Huggingface Dataset]
Multi-Crit is the first benchmark designed to evaluate whether multimodal judge models can follow diverse, fine-grained evaluation criteria and deliver reliable criterion-level judgment. It provides multi-criterion human preference annotations for each pair of candidate model responses, and introduces additional metrics to assess an LMM judge’s ability to adhere to pluralistic criteria and to handle criterion-level tradeoffs and conflicts. Multi-Crit offers a challenging suite for rigorously studying and improving the reliability and steerability of multimodal judges.
-
Install the required packages (we use Python 3.10):
pip install -r requirements.txt
-
Set up your ENVIRONMENTAL_VARIABLES: create a .env file in the project root and add:
export OPENAI_API_KEY="<your-openai-api-key-here>" export OPENAI_API_URL="<your-openai-api-url-here>" # e.g. "https://api.openai.com/v1/chat/completions"
Download the source images and jsonl files to ./datasets/:
python multi-crit/utils/download_raw_data.pyAfter downloading, the dataset directory should look like:
datasets/
├── images/
├── multi-crit-openEnded-flatten.jsonl # data for the open-ended split
└── multi-crit-reasoning-flatten.jsonl # data for the reasoning split
-
We provide example scripts for running
gpt-4o:-
Open-ended split:
bash multi-crit/scripts/run_gpt-4o_open-ended.sh
-
Reasoning split::
bash multi-crit/scripts/run_gpt-4o_reasoning.sh
The results will be saved to
./outputs/<split>/<model-name>/result.json -
-
Run with your own judge models:
- Store your predictions in a JSONL file using the following format (your model’s judgment goes into the
criticfield):
{ "question_id": "visualPuzzles__select-1__efficiency", "image_path": "visualPuzzles/images/image_656.png", "split": "reasoning", "criterion": "efficiency", "preference": "model_a", "prompt_id": "visualPuzzles__select-1", ... "critic": { "raw-pred": "<raw-prediction-from-your-judge>", "pred-model": "<your-model-name>", "pred": "model_a" // Final prediction: 'model_a' or 'model_b' } } - Store your predictions in a JSONL file using the following format (your model’s judgment goes into the
If you find our benchmark useful, please kindly star this repository and cite our paper as follows:
@misc{xiong2025multicritbenchmarkingmultimodaljudges,
title={Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following},
author={Tianyi Xiong and Yi Ge and Ming Li and Zuolong Zhang and Pranav Kulkarni and Kaishen Wang and Qi He and Zeying Zhu and Chenxi Liu and Ruibo Chen and Tong Zheng and Yanshuo Chen and Xiyao Wang and Renrui Zhang and Wenhu Chen and Heng Huang},
year={2025},
eprint={2511.21662},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.21662},
}
-
Multi-Crit covers two major scopes of multimodal judgment:
-
Open-ended content generation, where responses are free-form and traditional fixed metrics are limited;
Criteria Brief description Completeness and Coverage Address the full scope of the task in the user's query, covering all major elements specified in the prompt as well as relevant visual aspects and contextual cues. Visual Grounding and Details Reference observable elements in the image such as objects, spatial relationships, colors, or text, and bases its description or analysis on these details. Factuality / No Hallucination Avoid visual or factual errors, ensuring all details and claims are presented in the image or reasonably supported by the prompt. Creativity and Expressiveness Demonstrates imagination and originality when appropriate, or precise and knowledgeable articulation for analytical tasks, while remaining contextually appropriate. Clarity and Coherence Communicates ideas clearly and logically, with fluent language, well-organized structure, and smooth flow of information. -
Verifiable reasoning tasks, where the judge model evaluates the quality of model-generated reasoning processes leading to objectively verifiable answers.
Criteria Brief description Visual Grounding Reference important visual elements—such as objects, layout, or text—and integrates them meaningfully into the reasoning. Logic Coherence and Consistency Follow a clear, step-by-step logic without contradictions or unjustified leaps, and ensures the answer aligns with reasoning. Factuality / No Hallucination Ensure accuracy of all claims and support them with the input, avoiding hallucinated visual details or factual errors. Reflection and Exploration Demonstrate depth of reasoning through reflection and exploration of alternative interpretations, particularly in complex tasks. Conciseness and Efficiency Remains concise and focused, matching the task complexity while avoiding redundancy or unnecessary over-analysis.
-
-
Diverse multimodal prompts, challenging response pairs, and high-quality human annotations, collected through a rigorous data-curation pipeline.
- 425 multimodal prompts from 8 evaluation sources, paired with response pairs from 11 off-the-shelf LMMs.
- 1,425 criterion-level human judgments
- 1,000 for open-ended tasks, 425 for verifiable reasoning tasks.
- 782 criterion-level conflict cases, where two criteria prefer different responses
-
Three additional metrics
Metric Abbrev. Description Pluralistic Accuracy $PAcc$ Check whether the judge gets all criteria correct for each evaluation instance. Trade-off Sensitivity $TOS$ Assess whether the judge can detect at least one criterion-level trade-off between the two responses when humans disagree—i.e., predict at least one criterion pair with opposite preferences. Conflict Matching Rate $CMR$ Measure whether the judge can correctly resolve each conflicting criterion pair, i.e., predicts both sides of every conflict in agreement with human labels.







