Skip to content

Conversation

@airMeng
Copy link
Contributor

@airMeng airMeng commented May 27, 2024

The issue is exposed during #6408

I will split #6408 into several smaller PRs for eazy reviewing, the tasks will be updated according to issues exposed

  • GEMM aligning with CUDA backends
  • remove all global variables and pass the context instead
  • leave the async copy to common backend
  • separate GEMM files outside of ggml-sycl.c for maintainability.

Copy link
Contributor

@joeatodd joeatodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a welcome addition since it simplifies the MatMul dispatch 😄

We spotted a couple of issues with the code, I've added comments. I hope they're helpful.

ggml-sycl.cpp Outdated
Comment on lines 15237 to 15238
use_mul_mat_vec_q = use_mul_mat_vec_q; // Check dp4a
use_mul_mat_q = use_mul_mat_q ; // check dp4a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines don't do anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there will be refactoring about SYCL computation capablities in #6408 keep here for reminder

ggml-sycl.cpp Outdated
use_mul_mat_vec_q = use_mul_mat_vec_q; // Check dp4a
use_mul_mat_q = use_mul_mat_q ; // check dp4a
#ifdef SYCL_USE_XMX
use_mul_mat_q = use_mul_mat_q && (!fp16_performance_good || src1->ne[1] <= MMQ_MAX_BATCH_SIZE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic in this block is a little confusing:
(!fp16_performance_good || src1->ne[1] <= MMQ_MAX_BATCH_SIZE).

It says: use MMQ if either 1) FP16 perf is bad, or 2) the number of columns in src1 is less than or equal to the maximum.

Is this the intention?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just align with CUDA logic

@github-actions github-actions bot added build Compilation issues SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 27, 2024
Copy link
Contributor

@AidanBeltonS AidanBeltonS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am seeing a new test failures on Arc A770, is this to be expected?

  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1]): GGML_ASSERT: /builds/perseus-performance-libraries/llama_ci/llama.cpp/ggml-sycl.cpp:13858: false
ggml_sycl_op_dequantize_mul_mat_vec unsupported GGML_TYPE 16

ggml-sycl.cpp Outdated
Comment on lines 15176 to 15197
bool ggml_sycl_supports_mmq(enum ggml_type type) {
// TODO: accuracy issues in MMQ
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on what accuracy issues you are having with MMQ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the master using ggml_sycl_op_mul_mat_sycl for these 5 cases, you can try to force using MMQ

  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[1,1],nr=[1,1]): ggml_sycl_op_mul_mat_sycl
OK
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,1],nr=[1,1]): ggml_sycl_op_mul_mat_sycl
OK
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,1],nr=[2,1]): ggml_sycl_op_mul_mat_sycl
OK
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[1,1]): ggml_sycl_op_mul_mat_sycl
OK
  MUL_MAT(type_a=q4_0,type_b=f32,m=16,n=16,k=256,bs=[10,10],nr=[2,1]): ggml_sycl_op_mul_mat_sycl

@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label May 28, 2024
@airMeng airMeng force-pushed the sycl-gemm-dispatch branch from bcadd61 to bfed283 Compare May 28, 2024 06:07
@airMeng
Copy link
Contributor Author

airMeng commented May 28, 2024

I am seeing a new test failures on Arc A770, is this to be expected?

  MUL_MAT(type_a=iq2_xxs,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,1]): GGML_ASSERT: /builds/perseus-performance-libraries/llama_ci/llama.cpp/ggml-sycl.cpp:13858: false
ggml_sycl_op_dequantize_mul_mat_vec unsupported GGML_TYPE 16

fixed

@airMeng airMeng requested a review from NeoZhangJianyu May 28, 2024 07:30
airMeng and others added 5 commits May 28, 2024 20:59
Co-authored-by: Neo Zhang Jianyu <[email protected]>
Co-authored-by: Neo Zhang Jianyu <[email protected]>
Co-authored-by: Neo Zhang Jianyu <[email protected]>
Co-authored-by: Neo Zhang Jianyu <[email protected]>
Copy link
Collaborator

@NeoZhangJianyu NeoZhangJianyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance is increased from 34 token/s to 37 on Arc770 with llama2-7b-Q4.
It's good!

@airMeng airMeng merged commit b864b50 into master May 28, 2024
@airMeng airMeng deleted the sycl-gemm-dispatch branch May 30, 2024 01:56
@airMeng airMeng mentioned this pull request Jun 3, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build Compilation issues Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants