-
Notifications
You must be signed in to change notification settings - Fork 12.3k
llama : Support llama 4 text-only #12791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
79ebef8
b19dbd0
f6d8e75
1fb1888
869d7d9
6ceae82
7cfc237
edbaaf4
a518c11
46fe5cb
ab91ab2
f9c788d
e4012e6
2a9b29a
ee06e9b
f8f1bd4
e6a2809
af1968c
09eba6a
b28cd9c
d3e67f9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -557,6 +557,11 @@ void llama_model::load_hparams(llama_model_loader & ml) { | |
ml.get_key(LLM_KV_ATTENTION_LAYERNORM_RMS_EPS, hparams.f_norm_rms_eps); | ||
ml.get_key(LLM_KV_EXPERT_FEED_FORWARD_LENGTH, hparams.n_ff_exp); | ||
ml.get_key(LLM_KV_INTERLEAVE_MOE_LAYER_STEP, hparams.n_moe_layer_step); | ||
// hack: we use SWA to store the chunked attn mask | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, SWA -> AUX makes sense. Btw, we should soon implement actual SWA / chunked attention that uses less memory. It shouldn't be a big change and will improve memory usage significantly for such models. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The rename makes quite more changes than I expected, so I think I'll do it in another PR to test it more thoroughly. Here I'll only edit my comment to make it more clear that I'm using the "swa" variable to store the chunked mask There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll do this after having the logits matching automated test. The problem is that changing name I think it's better to add an enum called |
||
// luckily, the n_swa_pattern is the same as chunked layer pattern: 3 chunked - 1 full | ||
hparams.n_swa_pattern = 4; | ||
hparams.n_attn_chunk = 8192; // should this be a gguf kv? currently it's the same for Scout and Maverick | ||
hparams.n_swa = 1; // unused, added to trigger the SWA | ||
|
||
switch (hparams.n_expert) { | ||
case 16: type = LLM_TYPE_17B_16E; break; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngxson Here in this check, I think that the second condition is always
false
. So we can simplify to:... if (kv_self->cells[i].pos < pos_chunk_start) { ...
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah thanks for noticing, yes it is redundant.
Btw I'm thinking about how to refactor the mask generation is a way the the code is more easier to understand (i.e. make it sounds closer to english). My idea looks like this:
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can improve - I'm already doing some improvements in this regard in #13194