[VLM] Accuracy Evaluation #2393

johncalesp · 2025-11-19T16:23:27Z

This PR serves as a proposal to evaluate the accuracy of the VLM.
Notes

Added min.query.count to the Task since as of now, LoadGen sends a number of request equal to the total number of records in the dataset and we need a knob to control the number of request to send.
Added the class Evaluator to run the evaluation (the method calculate_exact_match was left in case we need during development, but it may be deleted later)
Added extra dependencies.

As of now, running 1k samples, I get:

╒═══════════════╤════════════╕
│ Fields        │   F1 Score │
╞═══════════════╪════════════╡
│ category      │   0.777359 │
├───────────────┼────────────┤
│ is_secondhand │   0.105263 │
╘═══════════════╧════════════╛

github-actions · 2025-11-19T16:23:38Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/cli.py

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/task.py

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/evaluation.py

wangshangsam · 2025-11-20T00:09:09Z

As it stands, is_secondhand seems quite low. Did you get this score after including the product description or before? Which model did you use for this?

johncalesp · 2025-11-20T19:54:21Z

The results were after including product description, and the model that I'm using is Qwen/Qwen2.5-VL-32B-Instruct

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/task.py

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/cli.py

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/evaluation.py

wangshangsam · 2025-11-21T22:13:10Z

The results were after including product description, and the model that I'm using is Qwen/Qwen2.5-VL-32B-Instruct

We should try on the actual Qwen/Qwen3-VL-235B-A22B-Instruct and Qwen/Qwen3-VL-235B-A22B-Thinking, but that can wait after this PR is merged.

wangshangsam · 2025-11-23T21:20:12Z

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/task.py

            token=dataset_cli.token,
+            split="+".join(dataset_cli.split),
        )
+        logger.info(f"LEN: {len(self.dataset)}")


Suggested change

logger.info(f"LEN: {len(self.dataset)}")

logger.debug("Loaded {} samples from the dataset splits {}.", len(self.dataset), dataset_cli.split)

https://docs.astral.sh/ruff/rules/logging-f-string/

The logging module provides a mechanism for passing additional values to be logged using the extra keyword argument. This is more consistent, more efficient, and less error-prone than formatting the string directly.

wangshangsam · 2025-11-23T21:32:27Z

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/task.py

+- is_secondhand
+You must choose only one, which is the most appropriate/correct,
+category out of the list of possible product categories.
+Your response should only contain a valid JSON object and nothing more.
+The JSON object should match the followng JSON schema:


Suggested change

- is_secondhand

You must choose only one, which is the most appropriate/correct,

category out of the list of possible product categories.

Your response should only contain a valid JSON object and nothing more.

The JSON object should match the followng JSON schema:

- is_secondhand

You must choose only one, which is the most appropriate/correct,

category out of the list of possible product categories.

Your response should only contain a valid JSON object and nothing more.

The JSON object should match the followng JSON schema:

wangshangsam · 2025-11-23T21:37:26Z

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/task.py

+```text
+{sample['product_title']}
+```
+The description of the product is the following:
+```text
+{sample['product_description']}
+```
+The following are the possible product categories:
+```json
+{sample['potential_product_categories']}
+```


Suggested change

```text

{sample['product_title']}

```

The description of the product is the following:

```text

{sample['product_description']}

```

The following are the possible product categories:

```json

{sample['potential_product_categories']}

```

```text

{sample['product_title']}

```

The description of the product is the following:

```text

{sample['product_description']}

```

The following are the possible product categories:

```json

{sample['potential_product_categories']}

```

wangshangsam · 2025-11-24T01:55:19Z

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/task.py

        constructor.
        """
-        return len(self.dataset)
+        return min(len(self.dataset), self.min_query_count)


Actually, I'm wondering, what was the exact problem that you encountered without this change?

I was just double-checking the QuerySampleLibrary interface definition (in loadgen/query_sample_library.h), and it seemed that the first parameter to ConstructQSL is supposed to be the overall range of valid indices, which is not necessrily correlated with min_query_count that defines how many actual queries to run in the test. Below is what Cursor tells me:

Understanding min_query_count in LoadGen

Exported on 11/23/2025 at 20:44:37 EST from Cursor (2.0.60)

User

@mlperf-inference In LoadGen, what would happen if min_query_count is lower than the first parameter to lg.ConstructQSL?

Cursor

Searching for ConstructQSL and min_query_count definitions and usage.

Reviewing validation logic for min_query_count and its relationship to the QSL's total sample count:

Checking how queries are issued and whether there's validation when min_query_count exceeds available samples:

Checking how sample indices are selected and whether samples wrap when queries exceed the total sample count:

Checking for warnings or errors when min_query_count exceeds available samples:

Answer

If min_query_count is lower than the first parameter to lg.ConstructQSL (which is total_sample_count), LoadGen will still work. It will reuse samples to meet the query count.

How it works:

ConstructQSL takes total_sample_count as the first parameter — the total number of samples available in the QSL.

min_query_count is the minimum number of queries to issue, not the number of unique samples.

Sample reuse behavior:

Performance mode: Uses std::uniform_int_distribution<>(0, sample_count - 1) to randomly select samples, allowing reuse.

Accuracy mode: Uses a shuffled list that wraps via modulo (indices.at((i++) % indices.size())).

Equal issue mode: Also uses modulo wrapping to cycle through samples.

Example:

If you have:

total_sample_count = 1000 (first parameter to ConstructQSL)

min_query_count = 5000

LoadGen will:

Generate 5000 queries

Reuse samples (randomly in performance mode, or by cycling in accuracy mode)

Complete the test successfully

Not generate any errors or warnings

This is expected behavior: min_query_count controls the number of queries, not unique samples. LoadGen reuses samples when needed to meet the query count requirement.

User

No no, I meant that total_sample_count is bigger than min_query_count

Cursor

Checking what happens when total_sample_count (first parameter to ConstructQSL) is larger than min_query_count.

Checking how LoadGen handles this case:

Answer

If total_sample_count (first parameter to lg.ConstructQSL) is larger than min_query_count, LoadGen will work normally. It will only use a subset of the available samples.

How it works:

Loadable set generation: GenerateLoadableSets creates loadable sets from all total_sample_count samples, partitioned by performance_sample_count:

const size_t qsl_total_count = qsl->TotalSampleCount(); std::vector<QuerySampleIndex> samples(qsl_total_count); for (size_t i = 0; i < qsl_total_count; i++) { samples[i] = static_cast<QuerySampleIndex>(i); }

Query generation: LoadGen generates exactly min_query_count queries, selecting samples from the loadable sets.

Sample selection:

In Performance mode: Uses the first loadable set (size = performance_sample_count)

In Accuracy mode: Uses all loadable sets sequentially

Samples are selected from these sets using the configured distribution

Example:

If you have:

total_sample_count = 5000 (first parameter to ConstructQSL)

min_query_count = 1000

performance_sample_count = 2000 (default or override)

LoadGen will:

Create loadable sets from all 5000 samples

Generate exactly 1000 queries

Use samples from the loadable sets (likely only from the first loadable set in performance mode)

Leave many samples unused — this is expected and fine

Important points:

No error or warning is generated

LoadGen doesn't require using all samples

total_sample_count indicates how many samples are available, not how many must be used

min_query_count controls how many queries to generate, independent of total sample count

This is a common scenario: you may have a large dataset but only need to run a smaller number of queries for testing or benchmarking.

const size_t qsl_total_count = qsl->TotalSampleCount(); std::vector<QuerySampleIndex> samples(qsl_total_count); for (size_t i = 0; i < qsl_total_count; i++) { samples[i] = static_cast<QuerySampleIndex>(i); }

wangshangsam · 2025-11-24T02:10:30Z

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/task.py

        """
        image_file = BytesIO()
-        sample["image"].save(image_file, format="PNG")
+        sample["product_image"].save(image_file, format="PNG")


I believe the image format in the newer version of the dataset is JPEG instead of PNG, which brings up a good point that I missed previously - we should probably parameterize the image format in the request to be consistent with the original dataset:

... sample["product_image"].save(image_file, format=sample["product_image"].format) ... "image_url": { "url": f"data:image/{sample["product_image"].format};base64,{image_base64_string}", }, ...

wangshangsam · 2025-11-24T02:29:54Z

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/cli.py

+
+    @field_validator("split", mode="before")
+    @classmethod
+    def normalize_and_validate_split(cls, v: str) -> list[str]:


I don't think you necessarily need a before field validator for parsing a list. A list of values can be passed in by repeating the same flag multiple times: (this was initially a Click convention which then got inherited by Typer):

(mlperf-inf) shangw@shangw-ldt:~/workspace/mlperf-inference/multimodal/vl2l$ mlperf-inf-mm-vl2l benchmark --dataset.split train --dataset.split test ... 2025-11-23 21:21:28.788 | INFO | mlperf_inference_multimodal_vl2l.cli:benchmark:486 - Running VL2L benchmark with dataset: repo_id='Shopify/the-catalogue-public-beta' token=None split=['train', 'test'] ...

vs.

(mlperf-inf) shangw@shangw-ldt:~/workspace/mlperf-inference/multimodal/vl2l$ mlperf-inf-mm-vl2l benchmark --dataset.split test --dataset.split train ... 2025-11-23 21:22:02.993 | INFO | mlperf_inference_multimodal_vl2l.cli:benchmark:486 - Running VL2L benchmark with dataset: repo_id='Shopify/the-catalogue-public-beta' token=None split=['test', 'train'] ...

Also, the splits aren't necessarily always called train and test (i.e., it could be called testing or validation or chinchilla :) ). If you want to validate if the splits are valid, the right approach to do that would be:

Using a after field validator, and let typer parse the list.

Get the available splits through datasets.get_dataset_split_names() and check against that.

wangshangsam · 2025-11-24T02:38:09Z

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/cli.py

 from .task import ShopifyGlobalCatalogue

 app = Typer()
+SplitType = Literal["train", "test"]


Remove unused type definiton.

wangshangsam · 2025-11-24T02:39:29Z

multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/cli.py

+                "Allowed values: 'train', 'test'."
+            ),
+        ),
+    ] = ["train"]


Suggested change

] = ["train"]

] = ["train", "test"]

Initial proposal for VLM - evaluation

35c9704

johncalesp requested a review from a team as a code owner November 19, 2025 16:23

wangshangsam suggested changes Nov 20, 2025

View reviewed changes

johncalesp and others added 2 commits November 20, 2025 14:52

address review comments and test hiclass implementation

f5995c3

[Automated Commit] Format Codebase

46e4fc0

wangshangsam suggested changes Nov 21, 2025

View reviewed changes

johncalesp and others added 2 commits November 22, 2025 21:07

additional fixes to reviews

0dc13dc

[Automated Commit] Format Codebase

4aa5e0d

wangshangsam suggested changes Nov 24, 2025

View reviewed changes

	logger.info(f"LEN: {len(self.dataset)}")
	logger.debug("Loaded {} samples from the dataset splits {}.", len(self.dataset), dataset_cli.split)

[VLM] Accuracy Evaluation #2393

Are you sure you want to change the base?

[VLM] Accuracy Evaluation #2393

Conversation

johncalesp commented Nov 19, 2025

Uh oh!

github-actions bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wangshangsam commented Nov 20, 2025

Uh oh!

johncalesp commented Nov 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wangshangsam commented Nov 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Understanding min_query_count in LoadGen

Answer

How it works:

Example:

Answer

How it works:

Example:

Important points:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Nov 19, 2025 •

edited

Loading