Skip to content

CUDA: fix crash with partial offloading of MoE #13439

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

JohannesGaessler
Copy link
Collaborator

Fixes #13437 .

There is an edge case for clearing the padding that I did not consider. I don't think this edge case can be sensibly fixed other than by falling back to cuBLAS where no padding is needed.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 10, 2025
@slaren
Copy link
Member

slaren commented May 10, 2025

What is the view? I am not able to reproduce the linked issue, and in the qwen3moe graph I don't see any views (of weights).

@slaren
Copy link
Member

slaren commented May 10, 2025

I was able to reproduce it by building for a lower cc. There is no view in the graph, this is caused by the slice created in ggml_cuda_mul_mat_id. This also fixes it:

diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
index 7643c4b8b..f8a4e7216 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
@@ -2071,8 +2071,8 @@ static void ggml_cuda_mul_mat_id(ggml_backend_cuda_context & ctx, ggml_tensor *
         ggml_tensor src0_slice = *src0;
         src0_slice.ne[2]    = 1;
         src0_slice.nb[3]    = src0_slice.nb[2];
-        src0_slice.op       = GGML_OP_VIEW;
-        src0_slice.view_src = dst->src[0]; // non-const pointer to src0
+        src0_slice.op       = GGML_OP_NONE;
+        src0_slice.view_src = nullptr;
         src0_slice.data     = (char *) src0->data + i02*nb02;

         ggml_tensor src1_slice;

@JohannesGaessler
Copy link
Collaborator Author

It's correct that an assert is being triggered because clearing the supposed padding would actually be unsafe in that situation. It's just that I did not correctly consider that the effective number of tokens for the slice of src0 could fall into a range where the code in ggml_cuda_mul_mat could decide to use MMVQ or MMQ after all.

@JohannesGaessler JohannesGaessler force-pushed the cuda-fix-partial-mmid branch from 2e93643 to d5d74a2 Compare May 10, 2025 21:58
@JohannesGaessler
Copy link
Collaborator Author

I relaxed the condition a bit because for src0->ne[0] % MATRIX_ROW_PADDING == 0 it would still be fine to use MMVQ or MMQ.

@slaren
Copy link
Member

slaren commented May 10, 2025

This doesn't fix the issue because the assert is still triggered. IMO the correct solution here would be to clear the padding of the view_src, but only if actually necessary.

@JohannesGaessler JohannesGaessler force-pushed the cuda-fix-partial-mmid branch from d5d74a2 to 4bc8f75 Compare May 10, 2025 22:18
@JohannesGaessler
Copy link
Collaborator Author

I think the asserts were also in the wrong place. There were cases where they were being triggered even if cudaMemsetAsync is not called.

@JohannesGaessler JohannesGaessler merged commit 7474e00 into ggml-org:master May 11, 2025
44 checks passed
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request May 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Eval bug: mmvq.cu:519: GGML_ASSERT(!src0->view_src) failed
2 participants