[SYCL] Overcoming workaround for mmap() allocation on Windows and remove useless wait #13482

s-Nick · 2025-05-12T13:04:37Z

This PR removes the usage of a workaround for mmap bug on some Intel GPUs on Linux. The bug is not present on Windows, so there is no meaning of having it in place.
This causes a small split in the codebase according to the OS in use, but it shows good performance improvements.
Moreover, it also removes some wait() on copy that are not necessary in SYCL backend, due to the usage of in_order queues.

The work introduced here is based on #13109

N.B All numbers assessed with GGML_SYCL_DISABLE_OPT=0

Lunar Lake's performance (this PR)

model	size	params	backend	ngl	sm	test	t/s
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	none	pp512	1330.42 ± 6.59
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	none	tg128	58.92 ± 0.46
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	none	pp512	2044.01 ± 13.08
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	none	tg128	44.47 ± 0.13
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	none	pp512	320.23 ± 0.97
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	none	tg128	22.66 ± 0.02
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	none	pp512	533.16 ± 1.41
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	none	tg128	15.41 ± 0.44
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	none	pp512	1402.31 ± 7.56
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	none	tg128	28.55 ± 0.06
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	none	pp512	502.78 ± 1.02
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	none	tg128	35.83 ± 0.07
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	none	pp512	807.02 ± 2.71
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	none	tg128	23.57 ± 0.08

build: 0e1009f (5334)

Lunar Lake's performance (#13109)

model	size	params	backend	ngl	sm	test	t/s
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	none	pp512	1323.21 ± 8.43
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	none	tg128	52.47 ± 0.42
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	none	pp512	1994.78 ± 6.69
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	none	tg128	40.50 ± 0.10
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	none	pp512	297.47 ± 0.49
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	none	tg128	21.58 ± 0.08
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	none	pp512	499.53 ± 2.32
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	none	tg128	15.54 ± 0.31
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	none	pp512	907.84 ± 0.56
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	none	tg128	27.54 ± 0.09
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	none	pp512	477.35 ± 0.33
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	none	tg128	33.95 ± 0.07
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	none	pp512	757.61 ± 1.53
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	none	tg128	21.80 ± 0.32

build: f7e7d2a (5331)

Battlemage(B580) performance (this PR)

model	size	params	backend	ngl	threads	sm	test	t/s
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	5	none	pp512	7314.80 ± 23.23
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	5	none	tg128	71.10 ± 2.21
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	5	none	pp512	7419.09 ± 27.47
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	5	none	tg128	88.57 ± 0.12
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	5	none	pp512	2147.78 ± 6.70
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	5	none	tg128	40.59 ± 0.07
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	5	none	pp512	2189.34 ± 2.19
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	5	none	tg128	38.32 ± 0.02
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	5	none	pp512	5605.63 ± 22.70
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	5	none	tg128	72.54 ± 0.29
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	5	none	pp512	3002.45 ± 4.25
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	5	none	tg128	62.49 ± 0.04
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	5	none	pp512	3103.20 ± 3.79
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	5	none	tg128	58.64 ± 0.01

build: 0e1009f (5334)

Battlemage(B580) performance(#13109 )

model	size	params	backend	ngl	threads	sm	test	t/s
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	5	none	pp512	7067.24 ± 53.67
qwen2 1.5B Q4_0	1013.62 MiB	1.78 B	SYCL	99	5	none	tg128	64.51 ± 0.33
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	5	none	pp512	7132.89 ± 28.96
qwen2 1.5B Q4_K - Medium	1.04 GiB	1.78 B	SYCL	99	5	none	tg128	78.58 ± 0.19
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	5	none	pp512	2109.49 ± 2.46
llama 7B Q4_0	3.57 GiB	6.74 B	SYCL	99	5	none	tg128	38.37 ± 0.11
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	5	none	pp512	2143.62 ± 0.99
llama 7B Q4_K - Medium	3.80 GiB	6.74 B	SYCL	99	5	none	tg128	36.33 ± 0.03
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	5	none	pp512	5322.20 ± 22.77
gemma2 2B Q4_K - Medium	1.59 GiB	2.61 B	SYCL	99	5	none	tg128	64.48 ± 0.08
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	5	none	pp512	2936.43 ± 7.73
phi3 3B Q4_0	2.03 GiB	3.82 B	SYCL	99	5	none	tg128	57.50 ± 0.11
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	5	none	pp512	3024.06 ± 8.17
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	SYCL	99	5	none	tg128	54.19 ± 0.05

build: f7e7d2a (5331)

NeoZhangJianyu · 2025-05-13T06:53:35Z

@s-Nick
This PR title is about mmap().
But there is more code about other functions.

Could you clear other code change in this PR?

The default queue is in order so many synchronization with the host are useless.

After some testing I found that mmap is supported on windows and for many GPUs on Linux. Therefore I remove the workaround for windows since it is not necessary.

SYCL backend introduced a workaround that allows execution of llama-bench also without specifying `--mmp 0` flag

NeoZhangJianyu · 2025-05-16T08:08:49Z

All wait() in SYCL backend have been confirmed with the value.
Don't rm them before detailed test.

s-Nick requested a review from Alcpz May 12, 2025 13:04

github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 12, 2025

s-Nick changed the title ~~[SYCL] Overcoming workaround for mmap() allocation on Windows~~ [SYCL] Overcoming workaround for mmap() allocation on Windows and remove useless wait May 15, 2025

s-Nick added 3 commits May 16, 2025 09:01

Remove useless wait from SYCL backend

417a1de

The default queue is in order so many synchronization with the host are useless.

Remove mmap workaround on windows

384dcb0

After some testing I found that mmap is supported on windows and for many GPUs on Linux. Therefore I remove the workaround for windows since it is not necessary.

Update llama-bench README

083f56b

SYCL backend introduced a workaround that allows execution of llama-bench also without specifying `--mmp 0` flag

s-Nick force-pushed the add_win_mmap_support branch from 0e1009f to 083f56b Compare May 16, 2025 08:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL] Overcoming workaround for mmap() allocation on Windows and remove useless wait #13482

[SYCL] Overcoming workaround for mmap() allocation on Windows and remove useless wait #13482

s-Nick commented May 12, 2025 •

edited

Loading

NeoZhangJianyu commented May 13, 2025

NeoZhangJianyu commented May 16, 2025

[SYCL] Overcoming workaround for mmap() allocation on Windows and remove useless wait #13482

Are you sure you want to change the base?

[SYCL] Overcoming workaround for mmap() allocation on Windows and remove useless wait #13482

Conversation

s-Nick commented May 12, 2025 • edited Loading

Lunar Lake's performance (this PR)

Lunar Lake's performance (#13109)

Battlemage(B580) performance (this PR)

Battlemage(B580) performance(#13109 )

NeoZhangJianyu commented May 13, 2025

NeoZhangJianyu commented May 16, 2025

s-Nick commented May 12, 2025 •

edited

Loading