-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Feature Request: allow mmap to take advantage of hugepage feature which has 10x speedup #12444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The speedup is 10x times. The speedup in what? Token evaluation or token generation? Or just the startup time? I think you mean you gain 10x speedup in startup time (time to first token). |
yes. it is 10x speedup in startup. also since it is a "stayin-memory", theoretically multiple processes can share same data file(mmap with Map_shared flag), it is very suitable for server to handle concurrent requests.
获取Outlook for Android<https://aka.ms/AAb9ysg>
…________________________________
From: shivams ***@***.***>
Sent: Friday, March 21, 2025 12:30:10 PM
To: ggml-org/llama.cpp ***@***.***>
Cc: nick huang ***@***.***>; Author ***@***.***>
Subject: Re: [ggml-org/llama.cpp] Feature Request: allow mmap to take advantage of hugepage feature which has 10x speedup (Issue #12444)
The speedup is 10x times. The speedup in what? Token evaluation or token generation? Or just the startup time? I think you mean you gain 10x speedup in startup time (time to first token).
―
Reply to this email directly, view it on GitHub<#12444 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AA6LMIYH5LVCE52CWTHVQJL2VOIVFAVCNFSM6AAAAABZHNOCZ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONBSGIZTCOJUGE>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
[shivams]shivams left a comment (ggml-org/llama.cpp#12444)<#12444 (comment)>
The speedup is 10x times. The speedup in what? Token evaluation or token generation? Or just the startup time? I think you mean you gain 10x speedup in startup time (time to first token).
―
Reply to this email directly, view it on GitHub<#12444 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AA6LMIYH5LVCE52CWTHVQJL2VOIVFAVCNFSM6AAAAABZHNOCZ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONBSGIZTCOJUGE>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
I don't have exact token generation speedup data, but I believe it is also significent performance gain in inference because in such ancient hardware(a second-handed Dell PowerEdge R720XD with 2xCPU of 2.8G, DDR3 of speed 1333 of 1.5T, SAS HDD 7200RPM 6GB/second, absolutely NO GPU at all) I run DeepSeek-r1:671b Q4_K, the token generation speed is 1 to 2 tokens/second. You can ask Deepseek or ChatGPT or Gemini to estimate token speed of such hardware, it will give you speed estimate maybe 10x slower.
获取Outlook for Android<https://aka.ms/AAb9ysg>
…________________________________
From: shivams ***@***.***>
Sent: Friday, March 21, 2025 12:30:10 PM
To: ggml-org/llama.cpp ***@***.***>
Cc: nick huang ***@***.***>; Author ***@***.***>
Subject: Re: [ggml-org/llama.cpp] Feature Request: allow mmap to take advantage of hugepage feature which has 10x speedup (Issue #12444)
The speedup is 10x times. The speedup in what? Token evaluation or token generation? Or just the startup time? I think you mean you gain 10x speedup in startup time (time to first token).
―
Reply to this email directly, view it on GitHub<#12444 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AA6LMIYH5LVCE52CWTHVQJL2VOIVFAVCNFSM6AAAAABZHNOCZ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONBSGIZTCOJUGE>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
[shivams]shivams left a comment (ggml-org/llama.cpp#12444)<#12444 (comment)>
The speedup is 10x times. The speedup in what? Token evaluation or token generation? Or just the startup time? I think you mean you gain 10x speedup in startup time (time to first token).
―
Reply to this email directly, view it on GitHub<#12444 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AA6LMIYH5LVCE52CWTHVQJL2VOIVFAVCNFSM6AAAAABZHNOCZ6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDONBSGIZTCOJUGE>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Your solution does not provide speedup in token generation, only startup time. I tested it. But nice work. You should make a pull request. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Prerequisites
Feature Description
using Linux kernel "hugepagetlbfs" has more than 10x speedup for extremely large model, i.e. DeepSeek-r1:671b of size of 400G. And there is very minimum code change(less than 10 lines see below).
Here is kernel doc about hugepage
and I have tested at my ubuntu 22.04 with a second-hand Dell server of 1.5T memory without any GPU to load and run DS huge model. The performance of "hugetlbfs" is significant.
Motivation
Linux kernel allows user to take advantage of huge page of RAM. This is highly efficient when loading extremely large model. In my experiment of DeepSeek-r1:671b of size 377G, the speedup is 10x times! And the code change is really minimum based on current implementation of "llama-mmap.cpp". The only missing thing is just page size alignment and "mmap" flag "MAP_HUGETLB".
I have a rough video to explain what is done .
Here is steps of using hugepage(my OS is ubuntu 22.04):
Possible Implementation
for mmap, just need to make sure mmap's size parameter is pagesize aligned. and "bit-or" flag MAP_HUGETLB
As for munmap, the default 4K-page size has to be changed to actual 2M page size.(I didn't tried 1G page size as not many platform supports it.)
The text was updated successfully, but these errors were encountered: