-
Notifications
You must be signed in to change notification settings - Fork 593
[VLM] Accuracy Evaluation #2393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/evaluation.py
Outdated
Show resolved
Hide resolved
multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/evaluation.py
Outdated
Show resolved
Hide resolved
multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/evaluation.py
Outdated
Show resolved
Hide resolved
multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/evaluation.py
Outdated
Show resolved
Hide resolved
multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/evaluation.py
Outdated
Show resolved
Hide resolved
|
As it stands, |
|
The results were after including |
multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/evaluation.py
Outdated
Show resolved
Hide resolved
multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/evaluation.py
Outdated
Show resolved
Hide resolved
multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/evaluation.py
Outdated
Show resolved
Hide resolved
multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/evaluation.py
Outdated
Show resolved
Hide resolved
multimodal/vl2l/src/mlperf_inference_multimodal_vl2l/evaluation.py
Outdated
Show resolved
Hide resolved
We should try on the actual |
| token=dataset_cli.token, | ||
| split="+".join(dataset_cli.split), | ||
| ) | ||
| logger.info(f"LEN: {len(self.dataset)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| logger.info(f"LEN: {len(self.dataset)}") | |
| logger.debug("Loaded {} samples from the dataset splits {}.", len(self.dataset), dataset_cli.split) |
https://docs.astral.sh/ruff/rules/logging-f-string/
The
loggingmodule provides a mechanism for passing additional values to be logged using the extra keyword argument. This is more consistent, more efficient, and less error-prone than formatting the string directly.
| - is_secondhand | ||
| You must choose only one, which is the most appropriate/correct, | ||
| category out of the list of possible product categories. | ||
| Your response should only contain a valid JSON object and nothing more. | ||
| The JSON object should match the followng JSON schema: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - is_secondhand | |
| You must choose only one, which is the most appropriate/correct, | |
| category out of the list of possible product categories. | |
| Your response should only contain a valid JSON object and nothing more. | |
| The JSON object should match the followng JSON schema: | |
| - is_secondhand | |
| You must choose only one, which is the most appropriate/correct, | |
| category out of the list of possible product categories. | |
| Your response should only contain a valid JSON object and nothing more. | |
| The JSON object should match the followng JSON schema: |
| ```text | ||
| {sample['product_title']} | ||
| ``` | ||
| The description of the product is the following: | ||
| ```text | ||
| {sample['product_description']} | ||
| ``` | ||
| The following are the possible product categories: | ||
| ```json | ||
| {sample['potential_product_categories']} | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ```text | |
| {sample['product_title']} | |
| ``` | |
| The description of the product is the following: | |
| ```text | |
| {sample['product_description']} | |
| ``` | |
| The following are the possible product categories: | |
| ```json | |
| {sample['potential_product_categories']} | |
| ``` | |
| ```text | |
| {sample['product_title']} | |
| ``` | |
| The description of the product is the following: | |
| ```text | |
| {sample['product_description']} | |
| ``` | |
| The following are the possible product categories: | |
| ```json | |
| {sample['potential_product_categories']} | |
| ``` |
| constructor. | ||
| """ | ||
| return len(self.dataset) | ||
| return min(len(self.dataset), self.min_query_count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I'm wondering, what was the exact problem that you encountered without this change?
I was just double-checking the QuerySampleLibrary interface definition (in loadgen/query_sample_library.h), and it seemed that the first parameter to ConstructQSL is supposed to be the overall range of valid indices, which is not necessrily correlated with min_query_count that defines how many actual queries to run in the test. Below is what Cursor tells me:
Understanding min_query_count in LoadGen
Exported on 11/23/2025 at 20:44:37 EST from Cursor (2.0.60)
User
@mlperf-inference In LoadGen, what would happen if min_query_count is lower than the first parameter to lg.ConstructQSL?
Cursor
Searching for ConstructQSL and min_query_count definitions and usage.
Reviewing validation logic for min_query_count and its relationship to the QSL's total sample count:
Checking how queries are issued and whether there's validation when min_query_count exceeds available samples:
Checking how sample indices are selected and whether samples wrap when queries exceed the total sample count:
Checking for warnings or errors when min_query_count exceeds available samples:
Answer
If min_query_count is lower than the first parameter to lg.ConstructQSL (which is total_sample_count), LoadGen will still work. It will reuse samples to meet the query count.
How it works:
-
ConstructQSLtakestotal_sample_countas the first parameter — the total number of samples available in the QSL. -
min_query_countis the minimum number of queries to issue, not the number of unique samples. -
Sample reuse behavior:
- Performance mode: Uses
std::uniform_int_distribution<>(0, sample_count - 1)to randomly select samples, allowing reuse. - Accuracy mode: Uses a shuffled list that wraps via modulo (
indices.at((i++) % indices.size())). - Equal issue mode: Also uses modulo wrapping to cycle through samples.
- Performance mode: Uses
Example:
If you have:
total_sample_count = 1000(first parameter toConstructQSL)min_query_count = 5000
LoadGen will:
- Generate 5000 queries
- Reuse samples (randomly in performance mode, or by cycling in accuracy mode)
- Complete the test successfully
- Not generate any errors or warnings
This is expected behavior: min_query_count controls the number of queries, not unique samples. LoadGen reuses samples when needed to meet the query count requirement.
User
No no, I meant that total_sample_count is bigger than min_query_count
Cursor
Checking what happens when total_sample_count (first parameter to ConstructQSL) is larger than min_query_count.
Checking how LoadGen handles this case:
Answer
If total_sample_count (first parameter to lg.ConstructQSL) is larger than min_query_count, LoadGen will work normally. It will only use a subset of the available samples.
How it works:
-
Loadable set generation:
GenerateLoadableSetscreates loadable sets from alltotal_sample_countsamples, partitioned byperformance_sample_count:const size_t qsl_total_count = qsl->TotalSampleCount(); std::vector<QuerySampleIndex> samples(qsl_total_count); for (size_t i = 0; i < qsl_total_count; i++) { samples[i] = static_cast<QuerySampleIndex>(i); }
-
Query generation: LoadGen generates exactly
min_query_countqueries, selecting samples from the loadable sets. -
Sample selection:
- In Performance mode: Uses the first loadable set (size =
performance_sample_count) - In Accuracy mode: Uses all loadable sets sequentially
- Samples are selected from these sets using the configured distribution
- In Performance mode: Uses the first loadable set (size =
Example:
If you have:
total_sample_count = 5000(first parameter toConstructQSL)min_query_count = 1000performance_sample_count = 2000(default or override)
LoadGen will:
- Create loadable sets from all 5000 samples
- Generate exactly 1000 queries
- Use samples from the loadable sets (likely only from the first loadable set in performance mode)
- Leave many samples unused — this is expected and fine
Important points:
- No error or warning is generated
- LoadGen doesn't require using all samples
total_sample_countindicates how many samples are available, not how many must be usedmin_query_countcontrols how many queries to generate, independent of total sample count
This is a common scenario: you may have a large dataset but only need to run a smaller number of queries for testing or benchmarking.
const size_t qsl_total_count = qsl->TotalSampleCount();
std::vector<QuerySampleIndex> samples(qsl_total_count);
for (size_t i = 0; i < qsl_total_count; i++) {
samples[i] = static_cast<QuerySampleIndex>(i);
}| """ | ||
| image_file = BytesIO() | ||
| sample["image"].save(image_file, format="PNG") | ||
| sample["product_image"].save(image_file, format="PNG") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the image format in the newer version of the dataset is JPEG instead of PNG, which brings up a good point that I missed previously - we should probably parameterize the image format in the request to be consistent with the original dataset:
...
sample["product_image"].save(image_file, format=sample["product_image"].format)
...
"image_url": {
"url": f"data:image/{sample["product_image"].format};base64,{image_base64_string}",
},
...|
|
||
| @field_validator("split", mode="before") | ||
| @classmethod | ||
| def normalize_and_validate_split(cls, v: str) -> list[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you necessarily need a before field validator for parsing a list. A list of values can be passed in by repeating the same flag multiple times: (this was initially a Click convention which then got inherited by Typer):
(mlperf-inf) shangw@shangw-ldt:~/workspace/mlperf-inference/multimodal/vl2l$ mlperf-inf-mm-vl2l benchmark --dataset.split train --dataset.split test
...
2025-11-23 21:21:28.788 | INFO | mlperf_inference_multimodal_vl2l.cli:benchmark:486 - Running VL2L benchmark with dataset: repo_id='Shopify/the-catalogue-public-beta' token=None split=['train', 'test']
...vs.
(mlperf-inf) shangw@shangw-ldt:~/workspace/mlperf-inference/multimodal/vl2l$ mlperf-inf-mm-vl2l benchmark --dataset.split test --dataset.split train
...
2025-11-23 21:22:02.993 | INFO | mlperf_inference_multimodal_vl2l.cli:benchmark:486 - Running VL2L benchmark with dataset: repo_id='Shopify/the-catalogue-public-beta' token=None split=['test', 'train']
...Also, the splits aren't necessarily always called train and test (i.e., it could be called testing or validation or chinchilla :) ). If you want to validate if the splits are valid, the right approach to do that would be:
- Using a
afterfield validator, and let typer parse the list. - Get the available splits through
datasets.get_dataset_split_names()and check against that.
| from .task import ShopifyGlobalCatalogue | ||
|
|
||
| app = Typer() | ||
| SplitType = Literal["train", "test"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused type definiton.
| "Allowed values: 'train', 'test'." | ||
| ), | ||
| ), | ||
| ] = ["train"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ] = ["train"] | |
| ] = ["train", "test"] |
This PR serves as a proposal to evaluate the accuracy of the VLM.
Notes
calculate_exact_matchwas left in case we need during development, but it may be deleted later)As of now, running 1k samples, I get: