-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Add GPT-OSS model code and config [1/N] #22327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for the GPT-OSS model, including its implementation, configuration, and registration in various registries. The model implementation appears to be a Mixture-of-Experts architecture. My review has identified a critical issue with device handling during weight loading and a high-severity issue with logging practices, both within the new vllm/model_executor/models/gpt_oss.py
file.
|
||
for name, weight in weights: | ||
# FIXME(woosuk): Remove this after testing. | ||
weight = weight.cuda() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded weight.cuda()
call inside the weight loading loop is problematic. It prevents the model from running on non-CUDA devices (e.g., for testing on CPU) and is inefficient as it moves weights to the GPU one by one. This can lead to performance issues and portability problems. The weight loading pipeline should handle device placement, and this line should be removed. The FIXME comment suggests this might be temporary, but it's a critical issue to address before merging.
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Noam Gat <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Paul Pak <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Xiao Yu <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Add model code and config only. Need to add the MXFP4 MoE support for functionality.