-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Support for NemotronH Nano VLM #23644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for the NemotronH Nano VLM model, with the core logic implemented in the new vllm/model_executor/models/nano_nemotron_vl.py
file. The implementation is comprehensive, but I've identified a critical issue in the model registry that would prevent the model from loading, along with a few high-severity issues including a debug print statement, a questionable hardcoded limit, and a stateful implementation detail that could be refactored for better robustness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for implementing this in vLLM! Some initial comments
@DarkLight1337 Thanks for reviewing! ping you back |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add this model to tests/models/multimodal/processing/test_common.py
to validate the processing logic?
Also need to add it to the test registry |
Thanks! @DarkLight1337 There isn't a HF card for this model yet - we will release it in the upcoming month. what do you think? |
Inside |
@DarkLight1337 Thanks for the review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's inherit from InternVL processor to avoid duplicate code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DarkLight1337 BaseNanoNemotronVLProcessor
and BaseInternVLProcessor
don’t follow the exact same logic.
For example:
In NanoNemotron
we don’t use min/max/max_dynamic_patch
, but we do persist normalization attributes.
Image/video processing is largely similar between the two, but NanoNemotron
assumes that image/video placeholders are in the prompt, which leads to slightly different handling (also the processing itself).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok sure, thanks for the explanation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the vision model part not implemented in vLLM yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, i have a draft PR which implemented the RADIO model with vLLM native.
after the current PR will merge ill rebase the below one
Signed-off-by: Daniel Afrimi <[email protected]> clean Signed-off-by: Daniel Afrimi <[email protected]> rename file Signed-off-by: Daniel Afrimi <[email protected]> CR Signed-off-by: Daniel Afrimi <[email protected]> refactor Signed-off-by: Daniel Afrimi <[email protected]> add video support Signed-off-by: Daniel Afrimi <[email protected]> add model to test regisry Signed-off-by: Daniel Afrimi <[email protected]> online serving works Signed-off-by: Daniel Afrimi <[email protected]> online serving works Signed-off-by: Daniel Afrimi <[email protected]> wip-remove some code Signed-off-by: Daniel Afrimi <[email protected]> online serving works after changing chat_template Signed-off-by: Daniel Afrimi <[email protected]>
Signed-off-by: Daniel Afrimi <[email protected]>
56fd086
to
7c7ccfa
Compare
return (self.language_model.mamba_cache. | ||
get_seqlen_agnostic_capture_inputs(batch_size)) | ||
|
||
@classmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DarkLight1337 added both class function which is required to calc the mamba page size. however those impl are in the llm model and in the vlm one (vllm/model_executor/models/config.py
).
what do you think about this WAR? wondering if there any nice impl for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tdoublep probably has a better understanding of this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation here looks good to me. We need these class methods get_mamba_state_shape_from_config
because in vllm/model_executor/models/config.py
we do some manipulation of the cache config (e.g., modifying the attention block size) and we need to be able to do this without actually instantiating the model.
@DarkLight1337 fix the above + add some comments. let me know what you think. |
@DarkLight1337 Can we approve it? or is any other changes needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM then, thanks
Signed-off-by: Daniel Afrimi <[email protected]>
Signed-off-by: Daniel Afrimi <[email protected]>
Signed-off-by: Daniel Afrimi <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>
Adds a new multimodal model implementation:
vllm/model_executor/models/nano_nemotron_vl.py
for online serving do the following:
vllm serve <hf-card> --runner generate --max-model-len 8192 --trust_remote_code
text prompt for video/image