-
Notifications
You must be signed in to change notification settings - Fork 12.2k
llama : greatly reduce output buffer memory usage #6122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
Show all changes
26 commits
Select commit
Hold shift + click to select a range
1fd1918
llama : greatly reduce logits memory usage
compilade 98914c0
llama : more compact state saving and reloading
compilade 705d393
llama : fix lctx.n_outputs not being set before building graph
compilade 25981fc
perplexity : adapt to the logits API changes
compilade 17b45c9
perplexity : fix Winogrande, use correct logits for second choice start
compilade d0129e8
perplexity : normalize spaces and punctuation in Winogrande sentences
compilade 487f89e
llama : fix embedding conditions
compilade 408fcb0
llama : fix llama_get_embeddings_ith when the resulting id is 0
compilade e19cb3a
llama : fix wrong n_outputs in llama_set_inputs
compilade a57fa7f
llama : fix not-skipping outputs of non-causal models
compilade 711b0bc
llama : fix running a batch with n_outputs == 0
compilade d100502
llama : keep same graph topology even when n_outputs == 0
compilade 99c37cc
ggml : saner ggml_can_repeat with empty tensors
compilade 6bf7f3f
ggml : do not multi-thread ops returning empty tensors
compilade 09bb15a
ggml : make ggml_is_empty public and work with views
compilade 4551e7e
llama : use a vector for ctx->output_ids
compilade 8b826c5
ggml : skip empty tensors in all backends
compilade d04cfaf
llama : fix llama_output_reserve nullptr deref when new_size is 0
compilade 8f70dcb
perplexity : make Winogrande work as it does on master
compilade 615a3a4
llama : clearer error messages for invalid logits or embeddings ids
compilade 7d8d6b5
llama : handle errors from llama_output_reserve at call sites
compilade 5f33a67
perplexity : make hellaswag and multiple-choice outputs identical to …
compilade ffa9abd
Merge branch 'master' into compilade/smaller-output-buffer
compilade e9095ac
llama : allow loading state saved with a different ctx size
compilade 5027d81
llama : minor
ggerganov 20248e8
readme : update recent API changes, and warn about Vulkan
compilade File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
perplexity : normalize spaces and punctuation in Winogrande sentences
- Loading branch information
commit d0129e8e2963c17dd9e17a7bec8f54b981fa949b
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change related to the PR? If it isn't, it should probably be in a different PR so that it is easier to review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's somewhat related, because I've changed how the Winogrande score is calculated (it didn't use the correct logits for the choice word of the second choice before, and it used a choice-word-skipping logic to mitigate the effects), and normalizing the sentences helped to make Phi-2_Q4_K_M's score a bit better (
72.6125 +/- 1.2533
when normalizing vs72.0600 +/- 1.2611
when not normalizing). Some of the sentences in the Winogrande dataset I used has weird punctuation and inconsistent spacing, which sometimes slightly skews the log-likelyhood, but this only affects 111/1267 of the tasks in https://huggingface.co/datasets/ikawrakow/winogrande-eval-for-llama.cpp.I could try extracting this into another PR or removing it altogether if it's too out-of-place, like if it's better to keep the dataset as-is even if some of the formatting confuses some models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can review the changes in llama.cpp and ggml, but I cannot review these changes. Since
perplexity
is used to test the correctness of many other things, I think it is important that the changes are carefully reviewed. So as it is, either someone else will have to review this PR, or it will have to be split.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case I could make Winogrande work as it does on
master
, even though it's wrong and needlessly complicated, and then fix it in a separate PR. All of the benchmarks inperplexity
(except Winogrande, but I could fix this) should give the exact same results in this PR as onmaster
, which should make it relatively easy to test that it works as expected or know if I made a mistake somewhere.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Wrong and needlessly complicated": this basically necessitates a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been testing hellaswag since there are some changes, and over 400 tasks with I see a difference of 0.25 compared to master (75.00 vs 74.75) with llama 7B F16. Over mistral F16 there is also a small difference (81.25 vs 81.00). Over the full 10042 tasks the result is basically the same so it is probably nothing, but I wonder if something changed that could result in a different computation? I thought it could be due to different matrix sizes, but there is also a difference with MMQ.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested with
stories260K.gguf
while showing the logprobs of the 4 choices of each task, and I see that the numbers (compared tomaster
) are (mostly) exactly the same for the first choice, while for the other three, the 4 most significant digits match while the rest seem to vary. So if this changes the final result, the choices were already very close to each other. This is either due to some numerical instability in floats or there's something I didwrongdifferently. (apparently, it's both)Okay, while typing this I ran some tests, and I don't understand why, but evaluating the last tokens of the sequences (even without extracting the logits) is apparently needed to get the exact same numbers as
master
. Previously, I simply skipped the evaluation of the last tokens because their output isn't necessary at all. It's weird because the choices are in separate sequences, so the KV cache shouldn't be affected... (tested with Mamba too, and at least my intuition holds there, so the way the KV cache is updated for Transformer-based models makes this different. Why would the KV cells of unrelated sequences be changed slightly like this?) I'll need to run tests for the other benchmarks and will push a fix tomorrow.I've also noticed that when running the same benchmark with different batch sizes (even on
master
), I see similar differences on batch boundaries. So maybe it's not a big deal? (the difference in logprobs have the same magnitude in both cases (the max difference I saw was around 0.0003)). Should I try to keep compatibility with the exact same "rounding errors" asmaster
? (EDIT: Answering to myself: Yes, keep compatibility, at least in this PR. Fix implemented in 5f33a67)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See this comment (and the one referenced in it) for explanation of why the results are different: ggml-org/whisper.cpp#1941 (comment)