Skip to content

[SYCL] Overcoming workaround for mmap() allocation on Windows and remove useless wait #13482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

s-Nick
Copy link
Collaborator

@s-Nick s-Nick commented May 12, 2025

This PR removes the usage of a workaround for mmap bug on some Intel GPUs on Linux. The bug is not present on Windows, so there is no meaning of having it in place.
This causes a small split in the codebase according to the OS in use, but it shows good performance improvements.
Moreover, it also removes some wait() on copy that are not necessary in SYCL backend, due to the usage of in_order queues.

The work introduced here is based on #13109

N.B All numbers assessed with GGML_SYCL_DISABLE_OPT=0

Lunar Lake's performance (this PR)

model size params backend ngl sm test t/s
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none pp512 1330.42 ± 6.59
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none tg128 58.92 ± 0.46
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 none pp512 2044.01 ± 13.08
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 none tg128 44.47 ± 0.13
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 none pp512 320.23 ± 0.97
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 none tg128 22.66 ± 0.02
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 none pp512 533.16 ± 1.41
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 none tg128 15.41 ± 0.44
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 none pp512 1402.31 ± 7.56
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 none tg128 28.55 ± 0.06
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 none pp512 502.78 ± 1.02
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 none tg128 35.83 ± 0.07
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 none pp512 807.02 ± 2.71
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 none tg128 23.57 ± 0.08

build: 0e1009f (5334)

Lunar Lake's performance (#13109)

model size params backend ngl sm mmap test t/s
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none 0 pp512 1323.21 ± 8.43
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none 0 tg128 52.47 ± 0.42
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 none 0 pp512 1994.78 ± 6.69
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 none 0 tg128 40.50 ± 0.10
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 none 0 pp512 297.47 ± 0.49
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 none 0 tg128 21.58 ± 0.08
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 none 0 pp512 499.53 ± 2.32
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 none 0 tg128 15.54 ± 0.31
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 none 0 pp512 907.84 ± 0.56
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 none 0 tg128 27.54 ± 0.09
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 none 0 pp512 477.35 ± 0.33
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 none 0 tg128 33.95 ± 0.07
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 none 0 pp512 757.61 ± 1.53
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 none 0 tg128 21.80 ± 0.32

build: f7e7d2a (5331)

Battlemage(B580) performance (this PR)

model size params backend ngl threads sm test t/s
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 5 none pp512 7314.80 ± 23.23
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 5 none tg128 71.10 ± 2.21
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 5 none pp512 7419.09 ± 27.47
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 5 none tg128 88.57 ± 0.12
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 5 none pp512 2147.78 ± 6.70
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 5 none tg128 40.59 ± 0.07
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 5 none pp512 2189.34 ± 2.19
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 5 none tg128 38.32 ± 0.02
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 5 none pp512 5605.63 ± 22.70
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 5 none tg128 72.54 ± 0.29
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 5 none pp512 3002.45 ± 4.25
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 5 none tg128 62.49 ± 0.04
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 5 none pp512 3103.20 ± 3.79
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 5 none tg128 58.64 ± 0.01

build: 0e1009f (5334)

Battlemage(B580) performance(#13109 )

model size params backend ngl threads sm mmap test t/s
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 5 none 0 pp512 7067.24 ± 53.67
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 5 none 0 tg128 64.51 ± 0.33
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 5 none 0 pp512 7132.89 ± 28.96
qwen2 1.5B Q4_K - Medium 1.04 GiB 1.78 B SYCL 99 5 none 0 tg128 78.58 ± 0.19
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 5 none 0 pp512 2109.49 ± 2.46
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 5 none 0 tg128 38.37 ± 0.11
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 5 none 0 pp512 2143.62 ± 0.99
llama 7B Q4_K - Medium 3.80 GiB 6.74 B SYCL 99 5 none 0 tg128 36.33 ± 0.03
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 5 none 0 pp512 5322.20 ± 22.77
gemma2 2B Q4_K - Medium 1.59 GiB 2.61 B SYCL 99 5 none 0 tg128 64.48 ± 0.08
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 5 none 0 pp512 2936.43 ± 7.73
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 5 none 0 tg128 57.50 ± 0.11
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 5 none 0 pp512 3024.06 ± 8.17
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B SYCL 99 5 none 0 tg128 54.19 ± 0.05

build: f7e7d2a (5331)

@s-Nick s-Nick requested a review from Alcpz May 12, 2025 13:04
@github-actions github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 12, 2025
@NeoZhangJianyu
Copy link
Collaborator

@s-Nick
This PR title is about mmap().
But there is more code about other functions.

Could you clear other code change in this PR?

@s-Nick s-Nick changed the title [SYCL] Overcoming workaround for mmap() allocation on Windows [SYCL] Overcoming workaround for mmap() allocation on Windows and remove useless wait May 15, 2025
s-Nick added 3 commits May 16, 2025 09:01
The default queue is in order so many synchronization with the host are
useless.
After some testing I found that mmap is supported on windows and for
many GPUs on Linux. Therefore I remove the workaround for windows since
it is not necessary.
SYCL backend introduced a workaround that allows execution of
llama-bench also without specifying `--mmp 0` flag
@s-Nick s-Nick force-pushed the add_win_mmap_support branch from 0e1009f to 083f56b Compare May 16, 2025 08:03
@NeoZhangJianyu
Copy link
Collaborator

All wait() in SYCL backend have been confirmed with the value.
Don't rm them before detailed test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants