Skip to content

sycl: disable reorder for sycl mulmat #13536

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sgeor255
Copy link
Contributor

@sgeor255 sgeor255 commented May 14, 2025

The reorder optimisation introduced a prompt processing performance regression for Q4_0 models. This PR disables reorder for the sycl mulmat which is the culprit of this regression.

Some performance numbers on Arc A770

  • Before this PR with GGML_SYCL_DISABLE_OPT=1
model size params backend ngl sm test t/s
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none pp512 4438.24 ± 4.02
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none tg128 40.81 ± 0.34
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 none pp512 1711.58 ± 2.02
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 none tg128 29.92 ± 0.21
llama 7B Q4_0 3.56 GiB 6.74 B SYCL 99 none pp512 1709.65 ± 1.31
llama 7B Q4_0 3.56 GiB 6.74 B SYCL 99 none tg128 29.90 ± 0.20
llama 8B Q4_0 4.33 GiB 8.03 B SYCL 99 none pp512 1663.50 ± 2.72
llama 8B Q4_0 4.33 GiB 8.03 B SYCL 99 none tg128 27.32 ± 0.23
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 none pp512 2453.67 ± 1.69
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 none tg128 36.18 ± 0.38
qwen3 4B Q4_0 2.20 GiB 4.02 B SYCL 99 none pp512 2237.05 ± 1.96
qwen3 4B Q4_0 2.20 GiB 4.02 B SYCL 99 none tg128 27.97 ± 0.26
qwen3 4B Q4_0 2.21 GiB 4.02 B SYCL 99 none pp512 2246.32 ± 1.79
qwen3 4B Q4_0 2.21 GiB 4.02 B SYCL 99 none tg128 27.76 ± 0.21

build: 24e86ca (5377)

  • Before this PR with GGML_SYCL_DISABLE_OPT=0
model size params backend ngl sm test t/s
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none pp512 4092.76 ± 7.60
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none tg128 45.07 ± 0.22
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 none pp512 1468.83 ± 0.91
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 none tg128 33.96 ± 0.19
llama 7B Q4_0 3.56 GiB 6.74 B SYCL 99 none pp512 1463.11 ± 0.75
llama 7B Q4_0 3.56 GiB 6.74 B SYCL 99 none tg128 34.26 ± 0.24
llama 8B Q4_0 4.33 GiB 8.03 B SYCL 99 none pp512 1406.29 ± 2.22
llama 8B Q4_0 4.33 GiB 8.03 B SYCL 99 none tg128 31.41 ± 0.30
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 none pp512 2165.61 ± 1.38
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 none tg128 39.70 ± 0.25
qwen3 4B Q4_0 2.20 GiB 4.02 B SYCL 99 none pp512 1989.89 ± 1.32
qwen3 4B Q4_0 2.20 GiB 4.02 B SYCL 99 none tg128 31.69 ± 0.29
qwen3 4B Q4_0 2.21 GiB 4.02 B SYCL 99 none pp512 1990.26 ± 2.98
qwen3 4B Q4_0 2.21 GiB 4.02 B SYCL 99 none tg128 30.98 ± 0.91

build: 24e86ca (5377)

  • This PR with GGML_SYCL_DISABLE_OPT=0
model size params backend ngl sm test t/s
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none pp512 4448.51 ± 10.81
qwen2 1.5B Q4_0 1013.62 MiB 1.78 B SYCL 99 none tg128 45.39 ± 0.33
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 none pp512 1715.47 ± 1.71
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 none tg128 34.33 ± 0.03
llama 7B Q4_0 3.56 GiB 6.74 B SYCL 99 none pp512 1716.05 ± 1.81
llama 7B Q4_0 3.56 GiB 6.74 B SYCL 99 none tg128 34.25 ± 0.22
llama 8B Q4_0 4.33 GiB 8.03 B SYCL 99 none pp512 1662.76 ± 0.68
llama 8B Q4_0 4.33 GiB 8.03 B SYCL 99 none tg128 31.59 ± 0.23
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 none pp512 2455.08 ± 2.67
phi3 3B Q4_0 2.03 GiB 3.82 B SYCL 99 none tg128 39.73 ± 0.30
qwen3 4B Q4_0 2.20 GiB 4.02 B SYCL 99 none pp512 2242.72 ± 1.88
qwen3 4B Q4_0 2.20 GiB 4.02 B SYCL 99 none tg128 31.87 ± 0.27
qwen3 4B Q4_0 2.21 GiB 4.02 B SYCL 99 none pp512 2252.19 ± 1.53
qwen3 4B Q4_0 2.21 GiB 4.02 B SYCL 99 none tg128 31.79 ± 0.24

build: 24e86ca (5377)

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 14, 2025
Copy link
Collaborator

@Rbiessy Rbiessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this is related to the discussions in #13003 (comment)
I don't know why I somehow couldn't reproduce these regressions at some point but I suspected this mul_mat could be the issue.

@NeoZhangJianyu
Copy link
Collaborator

@sgeor255
How about llama2-7B-Q4_0?

@sgeor255
Copy link
Contributor Author

sgeor255 commented May 14, 2025

@sgeor255 How about llama2-7B-Q4_0?

@NeoZhangJianyu results for this model are included in the PR description. Here's the list of models I ran in the order they are listed in the PR description:

  • DeepSeek-R1-Distill-Qwen-1.5B-Q4_0
  • emma-500-llama2-7b-Q4_0
  • llama-2-7b.Q4_0
  • Meta-Llama-3-8B-Instruct.Q4_0
  • Phi-3.5-mini-instruct-Q4_0
  • qwen3-4b-q4_0 (looks like I accidentally ran it twice)

@NeoZhangJianyu
Copy link
Collaborator

1715.47 ± 1.71
llama 7B Q4_0 3.57 GiB 6.74 B SYCL 99 none tg128 34.33 ± 0.03

OK, got it!

Thank you!

@NeoZhangJianyu
Copy link
Collaborator

I test this PR code on the private branch on Arc 770.
The result of llama2-7B-Q4_0 is changed more.
The performance is not changed more.
I try on oneAPI 2025.0 and 2025.1.1 on ubuntu 22.04 . Both are same result.

Could you check it again?

Following is the cmd and log.


git clone -b svet/sycl-mulmat-disable-reorder  https://github.com/sgeor255/llama.cpp svet
cd svet
./examples/sycl/run-llama2.sh
./examples/sycl/build.sh

#default setting:

GGML_SYCL_DISABLE_OPT: 1
GGML_SYCL_DISABLE_GRAPH: 1
GGML_SYCL_PRIORITIZE_DMMV: 0

Step 1: Get to know the basics of web design
Step 2: Set up a web hosting account
Step 3: Download a free website builder
Step 4: Set up a domain name
Step 5: Design your website
Step 6: Add content to your site
Step 7: Make the site responsive
Step 8: Add a contact form
Step 9: Add a social media share button
Step 10: Advertise your website


llama_perf_context_print: prompt eval time =     211.12 ms /    19 tokens (   11.11 ms per token,    90.00 tokens per second)
llama_perf_context_print:        eval time =   13956.09 ms /   399 runs   (   34.98 ms per token,    28.59 tokens per second)

#enable reorder

export GGML_SYCL_DISABLE_OPT=0


GGML_SYCL_DISABLE_OPT: 0
GGML_SYCL_DISABLE_GRAPH: 1
GGML_SYCL_PRIORITIZE_DMMV: 0


Step 1: Select the Website Name
Step 2: Select the Website Type
Step 3: Choose a Website Theme
Step 4: Choose a Website Name
Step 5: Choose a Website Theme
Step 6: Choose a Website Name
Step 7: Choose a Website Theme
Step 8: Choose a Website Name
Step 9: Choose a Website Theme
Step 10: Choose a Website Theme

llama_perf_context_print: prompt eval time =     210.16 ms /    19 tokens (   11.06 ms per token,    90.41 tokens per second)
llama_perf_context_print:        eval time =   13822.87 ms /   399 runs   (   34.64 ms per token,    28.87 tokens per second)

@NeoZhangJianyu
Copy link
Collaborator

NeoZhangJianyu commented May 15, 2025

I test on B570.
The performance is increased, but the result still has big gap to base code.

Please focus on the wrong issue issue.

@sgeor255
Copy link
Contributor Author

@NeoZhangJianyu I wasn't able to reproduce the issue, the output looks good when I run llama-cli with the same prompt & oneapi 2025.1.1

Arc 770

master
unning with Environment Variables:
  GGML_SYCL_DEBUG: 0
  GGML_SYCL_DISABLE_OPT: 0
  GGML_SYCL_DISABLE_GRAPH: 1
  GGML_SYCL_PRIORITIZE_DMMV: 0
Build with Macros:
  GGML_SYCL_FORCE_MMQ: no
  GGML_SYCL_F16: yes

sampler seed: 0
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 400, n_keep = 1

 Building a website can be done in 10 simple steps:
Step 1: Get domain and hosting
Step 2: Choose a theme
Step 3: Choose your colors
Step 4: Build your homepage
Step 5: Build your pages
Step 6: Build your blog
Step 7: Build your contact page
Step 8: Build your about page
Step 9: Add social media icons
Step 10: Add some copy
How much does it cost to build a website?
Is it easy to create a website?
What are the benefits of building a website?
How can you create a website for free?
There are many different ways to build a website, and the best way for you depends on your goals, budget, and expertise. However, there are some basic steps you can take to get started.
The first step is to choose a domain name and hosting plan. A domain name is your website’s address (e.g., www.example.com), while hosting is where your website files live on the internet. You’ll need to purchase both of these from a third-party provider.
Once you’ve got your domain and hosting, you’ll need to choose a website builder. A website builder is a platform that allows you to create and edit your website without having to know how to code. There are many different website builders to choose from, each with their own set of features and pricing plans.
Once you’ve chosen a website builder, the next step is to choose a theme. A theme is the look and feel of your website. There are many different themes to choose from, and each one will have its own set of features. You can usually find a demo of a theme to help you decide if it’s the right fit for your website.
After you’ve chosen a theme, it’s time to choose your colors. Colors can have a big impact on your website’s look and

llama_perf_sampler_print:    sampling time =       8.38 ms /   419 runs   (    0.02 ms per token, 50000.00 tokens per second)
llama_perf_context_print:        load time =    2292.39 ms
llama_perf_context_print: prompt eval time =     248.65 ms /    19 tokens (   13.09 ms per token,    76.41 tokens per second)
llama_perf_context_print:        eval time =   11483.49 ms /   399 runs   (   28.78 ms per token,    34.75 tokens per second)
llama_perf_context_print:       total time =   11751.25 ms /   418 tokens

this branch
Running with Environment Variables:
  GGML_SYCL_DEBUG: 0
  GGML_SYCL_DISABLE_OPT: 0
  GGML_SYCL_DISABLE_GRAPH: 1
  GGML_SYCL_PRIORITIZE_DMMV: 0
Build with Macros:
  GGML_SYCL_FORCE_MMQ: no
  GGML_SYCL_F16: yes

sampler seed: 0
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 400, n_keep = 1

 Building a website can be done in 10 simple steps:
Step 1: Get domain and hosting
Step 2: Choose a theme
Step 3: Choose your colors
Step 4: Build your homepage
Step 5: Build your pages
Step 6: Build your blog
Step 7: Build your contact page
Step 8: Build your about page
Step 9: Add social media icons
Step 10: Add some copy
How much does it cost to build a website?
Is it easy to create a website?
What are the benefits of building a website?
How can you create a website for free?
There are many different ways to build a website, and the best way for you depends on your goals, budget, and expertise. However, there are some basic steps you can take to get started.
The first step is to choose a domain name and hosting plan. A domain name is your website’s address (e.g., www.example.com), while hosting is where your website files live on the internet. You’ll need to purchase both of these from a third-party provider.
Once you’ve got your domain and hosting, you’ll need to choose a website builder. A website builder is a platform that allows you to create and edit your website without having to know how to code. There are many different website builders to choose from, each with their own set of features and pricing plans.
Once you’ve chosen a website builder, the next step is to choose a theme. A theme is the look and feel of your website. There are many different themes to choose from, and each one will have its own set of features. You can usually find a demo of a theme to help you decide if it’s the right fit for your website.
After you’ve chosen a theme, it’s time to choose your colors. Colors can have a big impact on your website’s look and

llama_perf_sampler_print:    sampling time =       8.11 ms /   419 runs   (    0.02 ms per token, 51651.87 tokens per second)
llama_perf_context_print:        load time =    1363.20 ms
llama_perf_context_print: prompt eval time =     199.47 ms /    19 tokens (   10.50 ms per token,    95.25 tokens per second)
llama_perf_context_print:        eval time =   12521.53 ms /   399 runs   (   31.38 ms per token,    31.87 tokens per second)
llama_perf_context_print:       total time =   12740.77 ms /   418 tokens

BM80

master
Running with Environment Variables:
  GGML_SYCL_DEBUG: 0
  GGML_SYCL_DISABLE_OPT: 0
  GGML_SYCL_DISABLE_GRAPH: 1
  GGML_SYCL_PRIORITIZE_DMMV: 0
Build with Macros:
  GGML_SYCL_FORCE_MMQ: no
  GGML_SYCL_F16: yes

sampler seed: 0
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 400, n_keep = 1

 Building a website can be done in 10 simple steps:
Step 1: Get to know the basics of web design
Step 2: Set up a web hosting account
Step 3: Download a free website builder
Step 4: Set up a domain name
Step 5: Design your website
Step 6: Add content to your site
Step 7: Customize your site
Step 8: Test your site
Step 9: Publish your site
Step 10: Optimize and promote your site
1. What is web design?
Web design is the process of designing a website. It includes the creation of both the visual and functional elements of a website, such as layout, navigation, and content. Web designers use a variety of tools and technologies to create websites that are both visually appealing and functional.
2. What is the purpose of web design?
The purpose of web design is to create a website that is visually appealing and easy to use. A well-designed website should be user-friendly and easy to navigate, with clear and concise content. Web designers use a variety of tools and technologies to create websites that are both visually appealing and functional.
3. What are the different types of web design?
There are a variety of different types of web design, depending on the purpose of the website. Some of the most common types of web design include:
4. What are the different stages of web design?
There are four main stages of web design:
The first stage of web design is planning and research. This stage involves determining the purpose and goals of the website, as well as the target audience.
The second stage of web design is the actual design process. This stage involves creating the visual elements and layout of the website, as well as determining the content that will be included.
The third stage of web design is the development process. This stage involves creating the functional elements of

llama_perf_sampler_print:    sampling time =       9.58 ms /   419 runs   (    0.02 ms per token, 43736.95 tokens per second)
llama_perf_context_print:        load time =    1771.44 ms
llama_perf_context_print: prompt eval time =    1231.65 ms /    19 tokens (   64.82 ms per token,    15.43 tokens per second)
llama_perf_context_print:        eval time =   84604.96 ms /   399 runs   (  212.04 ms per token,     4.72 tokens per second)
llama_perf_context_print:       total time =   85866.75 ms /   418 tokens

this branch
Running with Environment Variables:
  GGML_SYCL_DEBUG: 0
  GGML_SYCL_DISABLE_OPT: 0
  GGML_SYCL_DISABLE_GRAPH: 1
  GGML_SYCL_PRIORITIZE_DMMV: 0
Build with Macros:
  GGML_SYCL_FORCE_MMQ: no
  GGML_SYCL_F16: yes

sampler seed: 0
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 400, n_keep = 1

 Building a website can be done in 10 simple steps:
Step 1: Get to know the basics of web design
Step 2: Set up a web hosting account
Step 3: Download a free website builder
Step 4: Set up a domain name
Step 5: Design your website
Step 6: Add content to your site
Step 7: Customize your site
Step 8: Test your site
Step 9: Publish your site
Step 10: Optimize and promote your site
1. What is web design?
Web design is the process of designing a website. It includes the creation of both the visual and functional elements of a website, such as layout, navigation, and content. Web designers use a variety of tools and technologies to create websites that are both visually appealing and functional.
2. What is the purpose of web design?
The purpose of web design is to create a website that is visually appealing and easy to use. A well-designed website should be user-friendly and easy to navigate, with clear and concise content. Web designers use a variety of tools and technologies to create websites that are both visually appealing and functional.
3. What are the different types of web design?
There are a variety of different types of web design, depending on the purpose of the website. Some of the most common types of web design include:
4. What are the different stages of web design?
There are four main stages of web design:
The first stage of web design is planning and research. This stage involves determining the purpose and goals of the website, as well as the target audience.
The second stage of web design is the actual design process. This stage involves creating the visual elements and layout of the website, as well as determining the content that will be included.
The third stage of web design is the development process. This stage involves creating the functional elements of

llama_perf_sampler_print:    sampling time =       9.63 ms /   419 runs   (    0.02 ms per token, 43487.29 tokens per second)
llama_perf_context_print:        load time =    1678.24 ms
llama_perf_context_print: prompt eval time =    1216.09 ms /    19 tokens (   64.00 ms per token,    15.62 tokens per second)
llama_perf_context_print:        eval time =   84765.70 ms /   399 runs   (  212.45 ms per token,     4.71 tokens per second)
llama_perf_context_print:       total time =   86012.26 ms /   418 tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants