Skip to content

Total Memory Use #272

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
danhalliday opened this issue Dec 14, 2022 · 2 comments
Closed

Total Memory Use #272

danhalliday opened this issue Dec 14, 2022 · 2 comments
Labels
question Further information is requested
Milestone

Comments

@danhalliday
Copy link

First of all, this is a wonderful project and lots of fun to play with — thanks for all the hard work that went into it!

I’m wondering about what decides total memory use, particularly on iPhones. At the moment this implementation works well on iOS as seen in the Objective-C example. But the memory use (> 500MB for the base model) is obviously on the high side for anything but professional apps (ie. apps like Photoshop, more likely to be running on iPad anyway, where users will use them for a long period for specific pieces of work, and be more forgiving if all their other apps get terminated by the system).

On a high level, what are the constraints on total memory usage? Is it basically a fixed quantity relating to the work the encoder has to do? Is there any prospect of it coming down much in future, using quantisation or other techniques? Would future use of GPUs (or perhaps even the Apple Neural Engine) reduce the memory requirement, or would that only relate to a speedup in processing time? I’m really just trying to get a rough idea of what levers exist to be pulled, if any.

Thanks again!

@ggerganov ggerganov added the question Further information is requested label Dec 15, 2022
@ggerganov
Copy link
Member

It's actually possible to drastically reduce the runtime memory compared to what is currently being used.
For example, in the following base.en case:

$  ./main -m ./models/ggml-base.en.bin -f ./samples/jfk.wav
whisper_model_load: loading model from './models/ggml-base.en.bin'
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: f16           = 1
whisper_model_load: type          = 2
whisper_model_load: adding 1607 extra tokens
whisper_model_load: mem_required  =  506.00 MB
whisper_model_load: ggml ctx size =  140.60 MB
whisper_model_load: memory size   =   22.83 MB
whisper_model_load: model size    =  140.54 MB

We currently use a total of 506 MB, but we really need 140 MB to store the model and ~23 MB to store the KV cache (i.e. memory). The rest of the memory usage currently goes to store the intermediate tensors that are created by ggml during the inference. This is because we maintain the entire computation graph. But technically, we don't need it.

It will take some modifications in ggml to support this. Probably not an easy task at the moment for anyone else other than me, due to lack of good documentation of how the library works.

But yes - in theory, the mem usage can be reduced.

@danhalliday
Copy link
Author

This is great to have in mind. Thanks for the detailed information!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants