Skip to content

Conversation

@johncalesp
Copy link
Contributor

This PR serves as a proposal to evaluate the accuracy of the VLM.
Notes

  • Added min.query.count to the Task since as of now, LoadGen sends a number of request equal to the total number of records in the dataset and we need a knob to control the number of request to send.
  • Added the class Evaluator to run the evaluation (the method calculate_exact_match was left in case we need during development, but it may be deleted later)
  • Added extra dependencies.

As of now, running 1k samples, I get:

╒═══════════════╤════════════╕
│ Fields        │   F1 Score │
╞═══════════════╪════════════╡
│ category      │   0.777359 │
├───────────────┼────────────┤
│ is_secondhand │   0.105263 │
╘═══════════════╧════════════╛

@johncalesp johncalesp requested a review from a team as a code owner November 19, 2025 16:23
@github-actions
Copy link
Contributor

github-actions bot commented Nov 19, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@wangshangsam
Copy link
Contributor

As it stands, is_secondhand seems quite low. Did you get this score after including the product description or before? Which model did you use for this?

@johncalesp
Copy link
Contributor Author

The results were after including product description, and the model that I'm using is Qwen/Qwen2.5-VL-32B-Instruct

@wangshangsam
Copy link
Contributor

The results were after including product description, and the model that I'm using is Qwen/Qwen2.5-VL-32B-Instruct

We should try on the actual Qwen/Qwen3-VL-235B-A22B-Instruct and Qwen/Qwen3-VL-235B-A22B-Thinking, but that can wait after this PR is merged.

token=dataset_cli.token,
split="+".join(dataset_cli.split),
)
logger.info(f"LEN: {len(self.dataset)}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.info(f"LEN: {len(self.dataset)}")
logger.debug("Loaded {} samples from the dataset splits {}.", len(self.dataset), dataset_cli.split)

https://docs.astral.sh/ruff/rules/logging-f-string/

The logging module provides a mechanism for passing additional values to be logged using the extra keyword argument. This is more consistent, more efficient, and less error-prone than formatting the string directly.

Comment on lines +416 to +420
- is_secondhand
You must choose only one, which is the most appropriate/correct,
category out of the list of possible product categories.
Your response should only contain a valid JSON object and nothing more.
The JSON object should match the followng JSON schema:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- is_secondhand
You must choose only one, which is the most appropriate/correct,
category out of the list of possible product categories.
Your response should only contain a valid JSON object and nothing more.
The JSON object should match the followng JSON schema:
- is_secondhand
You must choose only one, which is the most appropriate/correct,
category out of the list of possible product categories.
Your response should only contain a valid JSON object and nothing more.
The JSON object should match the followng JSON schema:

Comment on lines +439 to +449
```text
{sample['product_title']}
```
The description of the product is the following:
```text
{sample['product_description']}
```
The following are the possible product categories:
```json
{sample['potential_product_categories']}
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
```text
{sample['product_title']}
```
The description of the product is the following:
```text
{sample['product_description']}
```
The following are the possible product categories:
```json
{sample['potential_product_categories']}
```
```text
{sample['product_title']}
```
The description of the product is the following:
```text
{sample['product_description']}
```
The following are the possible product categories:
```json
{sample['potential_product_categories']}
```

constructor.
"""
return len(self.dataset)
return min(len(self.dataset), self.min_query_count)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm wondering, what was the exact problem that you encountered without this change?

I was just double-checking the QuerySampleLibrary interface definition (in loadgen/query_sample_library.h), and it seemed that the first parameter to ConstructQSL is supposed to be the overall range of valid indices, which is not necessrily correlated with min_query_count that defines how many actual queries to run in the test. Below is what Cursor tells me:

Understanding min_query_count in LoadGen

Exported on 11/23/2025 at 20:44:37 EST from Cursor (2.0.60)


User

@mlperf-inference In LoadGen, what would happen if min_query_count is lower than the first parameter to lg.ConstructQSL?


Cursor

Searching for ConstructQSL and min_query_count definitions and usage.

Reviewing validation logic for min_query_count and its relationship to the QSL's total sample count:

Checking how queries are issued and whether there's validation when min_query_count exceeds available samples:

Checking how sample indices are selected and whether samples wrap when queries exceed the total sample count:

Checking for warnings or errors when min_query_count exceeds available samples:

Answer

If min_query_count is lower than the first parameter to lg.ConstructQSL (which is total_sample_count), LoadGen will still work. It will reuse samples to meet the query count.

How it works:

  1. ConstructQSL takes total_sample_count as the first parameter — the total number of samples available in the QSL.

  2. min_query_count is the minimum number of queries to issue, not the number of unique samples.

  3. Sample reuse behavior:

    • Performance mode: Uses std::uniform_int_distribution<>(0, sample_count - 1) to randomly select samples, allowing reuse.
    • Accuracy mode: Uses a shuffled list that wraps via modulo (indices.at((i++) % indices.size())).
    • Equal issue mode: Also uses modulo wrapping to cycle through samples.

Example:

If you have:

  • total_sample_count = 1000 (first parameter to ConstructQSL)
  • min_query_count = 5000

LoadGen will:

  • Generate 5000 queries
  • Reuse samples (randomly in performance mode, or by cycling in accuracy mode)
  • Complete the test successfully
  • Not generate any errors or warnings

This is expected behavior: min_query_count controls the number of queries, not unique samples. LoadGen reuses samples when needed to meet the query count requirement.


User

No no, I meant that total_sample_count is bigger than min_query_count


Cursor

Checking what happens when total_sample_count (first parameter to ConstructQSL) is larger than min_query_count.

Checking how LoadGen handles this case:

Answer

If total_sample_count (first parameter to lg.ConstructQSL) is larger than min_query_count, LoadGen will work normally. It will only use a subset of the available samples.

How it works:

  1. Loadable set generation: GenerateLoadableSets creates loadable sets from all total_sample_count samples, partitioned by performance_sample_count:

    const size_t qsl_total_count = qsl->TotalSampleCount();
    std::vector<QuerySampleIndex> samples(qsl_total_count);
    for (size_t i = 0; i < qsl_total_count; i++) {
      samples[i] = static_cast<QuerySampleIndex>(i);
    }
  2. Query generation: LoadGen generates exactly min_query_count queries, selecting samples from the loadable sets.

  3. Sample selection:

    • In Performance mode: Uses the first loadable set (size = performance_sample_count)
    • In Accuracy mode: Uses all loadable sets sequentially
    • Samples are selected from these sets using the configured distribution

Example:

If you have:

  • total_sample_count = 5000 (first parameter to ConstructQSL)
  • min_query_count = 1000
  • performance_sample_count = 2000 (default or override)

LoadGen will:

  • Create loadable sets from all 5000 samples
  • Generate exactly 1000 queries
  • Use samples from the loadable sets (likely only from the first loadable set in performance mode)
  • Leave many samples unused — this is expected and fine

Important points:

  • No error or warning is generated
  • LoadGen doesn't require using all samples
  • total_sample_count indicates how many samples are available, not how many must be used
  • min_query_count controls how many queries to generate, independent of total sample count

This is a common scenario: you may have a large dataset but only need to run a smaller number of queries for testing or benchmarking.

   const size_t qsl_total_count = qsl->TotalSampleCount();
   std::vector<QuerySampleIndex> samples(qsl_total_count);
   for (size_t i = 0; i < qsl_total_count; i++) {
     samples[i] = static_cast<QuerySampleIndex>(i);
   }

"""
image_file = BytesIO()
sample["image"].save(image_file, format="PNG")
sample["product_image"].save(image_file, format="PNG")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the image format in the newer version of the dataset is JPEG instead of PNG, which brings up a good point that I missed previously - we should probably parameterize the image format in the request to be consistent with the original dataset:

...
        sample["product_image"].save(image_file, format=sample["product_image"].format)
...
                        "image_url": {
                            "url": f"data:image/{sample["product_image"].format};base64,{image_base64_string}",
                        },
...


@field_validator("split", mode="before")
@classmethod
def normalize_and_validate_split(cls, v: str) -> list[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you necessarily need a before field validator for parsing a list. A list of values can be passed in by repeating the same flag multiple times: (this was initially a Click convention which then got inherited by Typer):

(mlperf-inf) shangw@shangw-ldt:~/workspace/mlperf-inference/multimodal/vl2l$ mlperf-inf-mm-vl2l benchmark --dataset.split train --dataset.split test
...
2025-11-23 21:21:28.788 | INFO     | mlperf_inference_multimodal_vl2l.cli:benchmark:486 - Running VL2L benchmark with dataset: repo_id='Shopify/the-catalogue-public-beta' token=None split=['train', 'test']
...

vs.

(mlperf-inf) shangw@shangw-ldt:~/workspace/mlperf-inference/multimodal/vl2l$ mlperf-inf-mm-vl2l benchmark --dataset.split test --dataset.split train
...
2025-11-23 21:22:02.993 | INFO     | mlperf_inference_multimodal_vl2l.cli:benchmark:486 - Running VL2L benchmark with dataset: repo_id='Shopify/the-catalogue-public-beta' token=None split=['test', 'train']
...

Also, the splits aren't necessarily always called train and test (i.e., it could be called testing or validation or chinchilla :) ). If you want to validate if the splits are valid, the right approach to do that would be:

  1. Using a after field validator, and let typer parse the list.
  2. Get the available splits through datasets.get_dataset_split_names() and check against that.

from .task import ShopifyGlobalCatalogue

app = Typer()
SplitType = Literal["train", "test"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unused type definiton.

"Allowed values: 'train', 'test'."
),
),
] = ["train"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
] = ["train"]
] = ["train", "test"]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants