Skip to content

Conversation

danielafrimi
Copy link
Contributor

@danielafrimi danielafrimi commented Aug 26, 2025

Adds a new multimodal model implementation: vllm/model_executor/models/nano_nemotron_vl.py

for online serving do the following:

  1. vllm serve <hf-card> --runner generate --max-model-len 8192 --trust_remote_code

  2. text prompt for video/image

        "role":
        "user",
        "content": [
            {
                "type": "text",
                # "text": "What's in this image?" #  <image> placeholder is in the prompt
                 "text": "What's in this video and whats on the tv?" # <video> placeholder is in the prompt
            },
            {
                "type": "video_url",
                "video_url": {
                    "url": video_url
                },

                # "type": "image_url",
                # "image_url": {
                #     "url": image_url
                # },
            },
        ],
    }],

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the NemotronH Nano VLM model, with the core logic implemented in the new vllm/model_executor/models/nano_nemotron_vl.py file. The implementation is comprehensive, but I've identified a critical issue in the model registry that would prevent the model from loading, along with a few high-severity issues including a debug print statement, a questionable hardcoded limit, and a stateful implementation detail that could be refactored for better robustness.

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing this in vLLM! Some initial comments

@danielafrimi
Copy link
Contributor Author

@DarkLight1337 Thanks for reviewing!

ping you back

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add this model to tests/models/multimodal/processing/test_common.py to validate the processing logic?

@DarkLight1337
Copy link
Member

DarkLight1337 commented Aug 28, 2025

Also need to add it to the test registry tests/models/registry.py

@danielafrimi
Copy link
Contributor Author

Thanks!

@DarkLight1337 There isn't a HF card for this model yet - we will release it in the upcoming month.
so for adding the model into tests/models/registry.py i dont know exactly the model repo name ( i can add it after the model release).
Same for tests/models/multimodal/processing/test_common.py it looks like i need to add the HF model_id.

what do you think?

@DarkLight1337
Copy link
Member

Inside tests/models/registry.py, you can add an entry with a dummy model name and set is_available_online=False, that should skip the test for the model

@danielafrimi
Copy link
Contributor Author

@DarkLight1337 Thanks for the review.
added the model to registry

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's inherit from InternVL processor to avoid duplicate code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DarkLight1337 BaseNanoNemotronVLProcessor and BaseInternVLProcessor don’t follow the exact same logic.
For example:

In NanoNemotron we don’t use min/max/max_dynamic_patch, but we do persist normalization attributes.
Image/video processing is largely similar between the two, but NanoNemotron assumes that image/video placeholders are in the prompt, which leads to slightly different handling (also the processing itself).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok sure, thanks for the explanation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the vision model part not implemented in vLLM yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i have a draft PR which implemented the RADIO model with vLLM native.
after the current PR will merge ill rebase the below one

PR

Signed-off-by: Daniel Afrimi <[email protected]>

clean

Signed-off-by: Daniel Afrimi <[email protected]>

rename file

Signed-off-by: Daniel Afrimi <[email protected]>

CR

Signed-off-by: Daniel Afrimi <[email protected]>

refactor

Signed-off-by: Daniel Afrimi <[email protected]>

add video support

Signed-off-by: Daniel Afrimi <[email protected]>

add model to test regisry

Signed-off-by: Daniel Afrimi <[email protected]>

online serving works

Signed-off-by: Daniel Afrimi <[email protected]>

online serving works

Signed-off-by: Daniel Afrimi <[email protected]>

wip-remove some code

Signed-off-by: Daniel Afrimi <[email protected]>

online serving works after changing chat_template

Signed-off-by: Daniel Afrimi <[email protected]>
Signed-off-by: Daniel Afrimi <[email protected]>
return (self.language_model.mamba_cache.
get_seqlen_agnostic_capture_inputs(batch_size))

@classmethod
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DarkLight1337 added both class function which is required to calc the mamba page size. however those impl are in the llm model and in the vlm one (vllm/model_executor/models/config.py).

what do you think about this WAR? wondering if there any nice impl for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdoublep probably has a better understanding of this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation here looks good to me. We need these class methods get_mamba_state_shape_from_config because in vllm/model_executor/models/config.py we do some manipulation of the cache config (e.g., modifying the attention block size) and we need to be able to do this without actually instantiating the model.

@danielafrimi
Copy link
Contributor Author

@DarkLight1337 fix the above + add some comments. let me know what you think.

@danielafrimi
Copy link
Contributor Author

@DarkLight1337 Can we approve it? or is any other changes needed

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM then, thanks

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) September 9, 2025 13:47
@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 9, 2025
@vllm-bot vllm-bot merged commit 72d3010 into vllm-project:main Sep 10, 2025
38 of 41 checks passed
skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
Signed-off-by: Daniel Afrimi <[email protected]>
Signed-off-by: xuebwang-amd <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new-model Requests to new models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants