Multi-Crit

Official Codebase for "Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following"

[🌐 Webpage] [📖 Paper] [🤗 Huggingface Dataset]

Multi-Crit is the first benchmark designed to evaluate whether multimodal judge models can follow diverse, fine-grained evaluation criteria and deliver reliable criterion-level judgment. It provides multi-criterion human preference annotations for each pair of candidate model responses, and introduces additional metrics to assess an LMM judge’s ability to adhere to pluralistic criteria and to handle criterion-level tradeoffs and conflicts. Multi-Crit offers a challenging suite for rigorously studying and improving the reliability and steerability of multimodal judges.

Quick Usage

Setup

Install the required packages (we use Python 3.10):
```
pip install -r requirements.txt
```

Set up your ENVIRONMENTAL_VARIABLES: create a .env file in the project root and add:

export OPENAI_API_KEY="<your-openai-api-key-here>"
export OPENAI_API_URL="<your-openai-api-url-here>" # e.g. "https://api.openai.com/v1/chat/completions"

Data Preparation

Download the source images and jsonl files to ./datasets/:

python multi-crit/utils/download_raw_data.py

After downloading, the dataset directory should look like:

datasets/
    ├── images/
    ├── multi-crit-openEnded-flatten.jsonl    # data for the open-ended split
    └── multi-crit-reasoning-flatten.jsonl    # data for the reasoning split

Run with LMM Judges

We provide example scripts for running gpt-4o:
- Open-ended split:
```
bash multi-crit/scripts/run_gpt-4o_open-ended.sh
```
- Reasoning split::
```
bash multi-crit/scripts/run_gpt-4o_reasoning.sh
```
The results will be saved to ./outputs/<split>/<model-name>/result.json

Run with your own judge models:

Store your predictions in a JSONL file using the following format (your model’s judgment goes into the critic field):

{
    "question_id": "visualPuzzles__select-1__efficiency", 
    "image_path": "visualPuzzles/images/image_656.png", 
    "split": "reasoning", 
    "criterion": "efficiency", 
    "preference": "model_a",    
    "prompt_id": "visualPuzzles__select-1", 
    ...
    "critic": {
        "raw-pred": "<raw-prediction-from-your-judge>", 
        "pred-model": "<your-model-name>", 
        "pred": "model_a" // Final prediction: 'model_a' or 'model_b'
    }
}

Citation

If you find our benchmark useful, please kindly star this repository and cite our paper as follows:

@misc{xiong2025multicritbenchmarkingmultimodaljudges,
      title={Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following}, 
      author={Tianyi Xiong and Yi Ge and Ming Li and Zuolong Zhang and Pranav Kulkarni and Kaishen Wang and Qi He and Zeying Zhu and Chenxi Liu and Ruibo Chen and Tong Zheng and Yanshuo Chen and Xiyao Wang and Renrui Zhang and Wenhu Chen and Heng Huang},
      year={2025},
      eprint={2511.21662},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.21662}, 
}

💥 Benchmark Introduction

Multi-Crit covers two major scopes of multimodal judgment:

Open-ended content generation, where responses are free-form and traditional fixed metrics are limited;

Criteria	Brief description
Completeness and Coverage	Address the full scope of the task in the user's query, covering all major elements specified in the prompt as well as relevant visual aspects and contextual cues.
Visual Grounding and Details	Reference observable elements in the image such as objects, spatial relationships, colors, or text, and bases its description or analysis on these details.
Factuality / No Hallucination	Avoid visual or factual errors, ensuring all details and claims are presented in the image or reasonably supported by the prompt.
Creativity and Expressiveness	Demonstrates imagination and originality when appropriate, or precise and knowledgeable articulation for analytical tasks, while remaining contextually appropriate.
Clarity and Coherence	Communicates ideas clearly and logically, with fluent language, well-organized structure, and smooth flow of information.

Verifiable reasoning tasks, where the judge model evaluates the quality of model-generated reasoning processes leading to objectively verifiable answers.

Criteria	Brief description
Visual Grounding	Reference important visual elements—such as objects, layout, or text—and integrates them meaningfully into the reasoning.
Logic Coherence and Consistency	Follow a clear, step-by-step logic without contradictions or unjustified leaps, and ensures the answer aligns with reasoning.
Factuality / No Hallucination	Ensure accuracy of all claims and support them with the input, avoiding hallucinated visual details or factual errors.
Reflection and Exploration	Demonstrate depth of reasoning through reflection and exploration of alternative interpretations, particularly in complex tasks.
Conciseness and Efficiency	Remains concise and focused, matching the task complexity while avoiding redundancy or unnecessary over-analysis.

Diverse multimodal prompts, challenging response pairs, and high-quality human annotations, collected through a rigorous data-curation pipeline.
- 425 multimodal prompts from 8 evaluation sources, paired with response pairs from 11 off-the-shelf LMMs.
- 1,425 criterion-level human judgments
  - 1,000 for open-ended tasks, 425 for verifiable reasoning tasks.
  - 782 criterion-level conflict cases, where two criteria prefer different responses

Three additional metrics

Metric	Abbrev.	Description
Pluralistic Accuracy	$PAcc$	Check whether the judge gets all criteria correct for each evaluation instance.
Trade-off Sensitivity	$TOS$	Assess whether the judge can detect at least one criterion-level trade-off between the two responses when humans disagree—i.e., predict at least one criterion pair with opposite preferences.
Conflict Matching Rate	$CMR$	Measure whether the judge can correctly resolve each conflicting criterion pair, i.e., predicts both sides of every conflict in agreement with human labels.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
multi-crit		multi-crit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multi-Crit

Quick Usage

Setup

Data Preparation

Run with LMM Judges

Citation

💥 Benchmark Introduction

Data Examples

Examples for Open-ended Judgment

Examples for Verifiable Reasoning Judgment

About

Uh oh!

Releases

Packages

Languages

License

tyxiong23/Multi-Crit

Folders and files

Latest commit

History

Repository files navigation

Multi-Crit

Quick Usage

Setup

Data Preparation

Run with LMM Judges

Citation

💥 Benchmark Introduction

Data Examples

Examples for Open-ended Judgment

Examples for Verifiable Reasoning Judgment

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages