Skip to content

tyxiong23/Multi-Crit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi-Crit

Official Codebase for "Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following"

[🌐 Webpage] [📖 Paper] [🤗 Huggingface Dataset]

Multi-Crit is the first benchmark designed to evaluate whether multimodal judge models can follow diverse, fine-grained evaluation criteria and deliver reliable criterion-level judgment. It provides multi-criterion human preference annotations for each pair of candidate model responses, and introduces additional metrics to assess an LMM judge’s ability to adhere to pluralistic criteria and to handle criterion-level tradeoffs and conflicts. Multi-Crit offers a challenging suite for rigorously studying and improving the reliability and steerability of multimodal judges.

Quick Usage

Setup

  1. Install the required packages (we use Python 3.10):

    pip install -r requirements.txt
  2. Set up your ENVIRONMENTAL_VARIABLES: create a .env file in the project root and add:

    export OPENAI_API_KEY="<your-openai-api-key-here>"
    export OPENAI_API_URL="<your-openai-api-url-here>" # e.g. "https://api.openai.com/v1/chat/completions"

Data Preparation

Download the source images and jsonl files to ./datasets/:

python multi-crit/utils/download_raw_data.py

After downloading, the dataset directory should look like:

datasets/
    ├── images/
    ├── multi-crit-openEnded-flatten.jsonl    # data for the open-ended split
    └── multi-crit-reasoning-flatten.jsonl    # data for the reasoning split

Run with LMM Judges

  • We provide example scripts for running gpt-4o:

    • Open-ended split:

      bash multi-crit/scripts/run_gpt-4o_open-ended.sh
    • Reasoning split::

      bash multi-crit/scripts/run_gpt-4o_reasoning.sh

    The results will be saved to ./outputs/<split>/<model-name>/result.json

  • Run with your own judge models:

    • Store your predictions in a JSONL file using the following format (your model’s judgment goes into the critic field):
    {
        "question_id": "visualPuzzles__select-1__efficiency", 
        "image_path": "visualPuzzles/images/image_656.png", 
        "split": "reasoning", 
        "criterion": "efficiency", 
        "preference": "model_a",    
        "prompt_id": "visualPuzzles__select-1", 
        ...
        "critic": {
            "raw-pred": "<raw-prediction-from-your-judge>", 
            "pred-model": "<your-model-name>", 
            "pred": "model_a" // Final prediction: 'model_a' or 'model_b'
        }
    }

Citation

If you find our benchmark useful, please kindly star this repository and cite our paper as follows:

@misc{xiong2025multicritbenchmarkingmultimodaljudges,
      title={Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following}, 
      author={Tianyi Xiong and Yi Ge and Ming Li and Zuolong Zhang and Pranav Kulkarni and Kaishen Wang and Qi He and Zeying Zhu and Chenxi Liu and Ruibo Chen and Tong Zheng and Yanshuo Chen and Xiyao Wang and Renrui Zhang and Wenhu Chen and Heng Huang},
      year={2025},
      eprint={2511.21662},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.21662}, 
}

💥 Benchmark Introduction

  • Multi-Crit covers two major scopes of multimodal judgment:

    1. Open-ended content generation, where responses are free-form and traditional fixed metrics are limited;

      Criteria Brief description
      Completeness and Coverage Address the full scope of the task in the user's query, covering all major elements specified in the prompt as well as relevant visual aspects and contextual cues.
      Visual Grounding and Details Reference observable elements in the image such as objects, spatial relationships, colors, or text, and bases its description or analysis on these details.
      Factuality / No Hallucination Avoid visual or factual errors, ensuring all details and claims are presented in the image or reasonably supported by the prompt.
      Creativity and Expressiveness Demonstrates imagination and originality when appropriate, or precise and knowledgeable articulation for analytical tasks, while remaining contextually appropriate.
      Clarity and Coherence Communicates ideas clearly and logically, with fluent language, well-organized structure, and smooth flow of information.
    2. Verifiable reasoning tasks, where the judge model evaluates the quality of model-generated reasoning processes leading to objectively verifiable answers.

      Criteria Brief description
      Visual Grounding Reference important visual elements—such as objects, layout, or text—and integrates them meaningfully into the reasoning.
      Logic Coherence and Consistency Follow a clear, step-by-step logic without contradictions or unjustified leaps, and ensures the answer aligns with reasoning.
      Factuality / No Hallucination Ensure accuracy of all claims and support them with the input, avoiding hallucinated visual details or factual errors.
      Reflection and Exploration Demonstrate depth of reasoning through reflection and exploration of alternative interpretations, particularly in complex tasks.
      Conciseness and Efficiency Remains concise and focused, matching the task complexity while avoiding redundancy or unnecessary over-analysis.
  • Diverse multimodal prompts, challenging response pairs, and high-quality human annotations, collected through a rigorous data-curation pipeline.

    • 425 multimodal prompts from 8 evaluation sources, paired with response pairs from 11 off-the-shelf LMMs.
    • 1,425 criterion-level human judgments
      • 1,000 for open-ended tasks, 425 for verifiable reasoning tasks.
      • 782 criterion-level conflict cases, where two criteria prefer different responses

  • Three additional metrics

    Metric Abbrev. Description
    Pluralistic Accuracy $PAcc$ Check whether the judge gets all criteria correct for each evaluation instance.
    Trade-off Sensitivity $TOS$ Assess whether the judge can detect at least one criterion-level trade-off between the two responses when humans disagree—i.e., predict at least one criterion pair with opposite preferences.
    Conflict Matching Rate $CMR$ Measure whether the judge can correctly resolve each conflicting criterion pair, i.e., predicts both sides of every conflict in agreement with human labels.

Data Examples

Examples for Open-ended Judgment

Examples for Verifiable Reasoning Judgment

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published