Question About use_adreno_kernels Threshold for Q4 MatMul on Adreno 750 #17733
forforever73
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
@lhez Sorry for taking your time, I’m running a new model on an Adreno 750 GPU and noticed for Q4 weights, using the optimized kernel CL_mul_mat_Ab_Bi_8x4 seems to require that use_adreno_kernels() returns true. However, in my model there are several matmul shapes like:
A: [256, 1280, 1, 1]
B: [256, 512, 1, 1]
→ Output: [1280, 512, 1, 1]
So the kernel falls back to kernel_mul_mat_q4_0_f32_1d_8x_flat. This fallback kernel is about 10× slower on Adreno 750. I experimented by modifying the internal threshold
int64_t threshold_ne0 = 256;After lowering the threshold, the Adreno kernels are used, performance improves dramatically, and the model’s PPL shows no meaningful change.
So what was the original reasoning behind the use_adreno_kernels() threshold? If I reduce the threshold to 256, is there any potential risk I should be aware of?
Beta Was this translation helpful? Give feedback.
All reactions