-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[Feature] GatedDeltaNet Automatic Prefix Caching #26807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Feature] GatedDeltaNet Automatic Prefix Caching #26807
Conversation
Signed-off-by: simondanielsson <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
|
||
@pytest.mark.parametrize("model", SSM_MODELS + HYBRID_MODELS) | ||
# @pytest.mark.parametrize("model", SSM_MODELS + HYBRID_MODELS) | ||
@pytest.mark.parametrize("model", ["tiny-random/qwen3-next-moe"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: revert
Signed-off-by: simondanielsson <[email protected]>
…on (vllm-project#24864) Signed-off-by: yuanyongjie.yyj <[email protected]> Signed-off-by: FENP <[email protected]> Signed-off-by: Jaya Yuan <[email protected]>
Signed-off-by: simondanielsson <[email protected]>
0e64636
to
1d3afe0
Compare
Signed-off-by: simondanielsson <[email protected]>
# used by e.g. Mamba2, NemotronH, Zamba | ||
chunk_size = getattr(self.hf_text_config, "chunk_size", None) | ||
return chunk_size | ||
return chunk_size or 64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: Need to find a better way to inject the chunk size. Currently this comes from the hardcoded chunk size in chunk_gated_delta_rule_fwd
Signed-off-by: simondanielsson <[email protected]>
slot_in_copy = slot_in_safe.clamp(min=0).to( | ||
device=conv_state.device, dtype=torch.long | ||
) | ||
breakpoint() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: remove these
Purpose
Part of #26201.
Adds Automatic Prefix Caching for GDN. Tries to be similar to APC for Mamba2 as introduced in #25752.
TODOs:
GDN_RECOMPUTE_SUPPRESS_LEVEL=4
.Test Plan
Note: this runs only with the tiny
tiny-random/qwen3-next-moe
model, as I only have an L4 with 20GB VRAM. Would be great if someone could try also with Qwen3-Next-80B-A3BTest Result
Note: gibberish output due to random model.
No cudagraphs (
enforce_eager=True
):With cudagraphs (
enforce_eager=False
):Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.