-
Notifications
You must be signed in to change notification settings - Fork 13.6k
HIP: RDNA4 tensor core support for MMF #17077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
On RDNA the WMMA instructions do to my knowledge not increase peak FLOPS, they only reduce I/O and register usage.
Yes sorry, that was a bug that I introduced. |
Thank you for the tip, AFAIK, tensor core on RDNA3 uses the same silicon of vector instructions, RDNA4 redesigns the tensor core and makes it more like CDNA. But at least it shall not be slower than hipblas, I shall spend sometime to find out the root cause, at least I know that hip compiler doesn't acquire register very well. |
|
Looking at the data layout I suspect the biggest problem has to do with shared memory bank conflicts or whatever you would call it for AMD. For NVIDIA I chose the shared memory layout to be padded with 16 bytes because the dedicated |
Thank you for the tips, there is little info for AMD bank layout, based on the limited document I have, RDNA3 has 32 banks in CU mode and 54 banks in WGP mode, WGP mode is the default one, the bank width is DWORD, I don't have any doc for RDNA4 so I assume it shall be similar as before, so I don't change any code logic in mmf, just adapter the wmma instruction. Based on the wmma layout of RDNA4, I just keep the old ldmatrix logic and use vectorized load in load_generic, honestly I'm not sure if shared memory bank conflict is the root cause. |
|
Are you aware of the AMD ISA documentation? |
Honestly, not very much as it isn't friendly for software developer, based on the gemm benchmark on my modified nvidia cute for RDNA3, the bank layout is same as nvidia ampere. Just have a check on RDNA4 ISA So, it's 32 banks in CU mode and 64 banks in WGP mode, the bank width is 4 bytes. |
|
I think I've found the root cause, mat_mmf_f is slower than hipblas, mat_mmf_ids is better than hipblas. MUL_MAT
MUL_MAT_ID
|
|
How about this: for now we move towards merging this PR but only enable it for |
Thank you for the support, this is also what I'm thinking, just disable mul_mat_f on RDNA4 first and try to rewrite a RDNA4 optimized version in the future. Also I presume that hip compiler would generate better code on RDNA3 than RDNA4, I will have a test on my 7900XTX next week. Anyway, could youplease review it first? One thing that hip compiler cannot handle early return code, it will still compile the code after return Also, I don't see performance improvement with real model like llama3-8b-fp16 and deepseek-r1-8b-fp16 with batch 1~16, looks like that I need to do the test with batch 512, right? |
|
@zhang-hui-yulo can you tell me if and when you intend to work on FA support or better MMF performance? That would make it easier for me to schedule my own concurrent work to avoid conflicts. |
Hello @JohannesGaessler, as I'm still not very familiar with llama.cpp internal code, I think my schedule shall be
I will start them once this PR is approved. Also I suggest you to put FA on RDNA3 to low priority as RDNA3 wmma isn't suitable for gemm fusion, you need shared memory to rearrange the layout for D matrix of QK. |
| #if defined(GGML_USE_HIP) && defined(RDNA4) && !defined(GGML_HIP_NO_WMMA) | ||
| #define AMD_WMMA_AVAILABLE | ||
| #endif // defined(GGML_USE_HIP) && defined(CDNA) && !defined(GGML_HIP_NO_MMQ_MFMA) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the GGML_HIP_NO_WMMA macro being used anywhere on master. Did you add it for testing? Is it still needed?
| if constexpr (I == 16 && J == 16) { | ||
| return 8 * (threadIdx.x / 16) + l; | ||
| } else { | ||
| static_assert(I == -1 && J == -1, "template specialization not implemented"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| static_assert(I == -1 && J == -1, "template specialization not implemented"); | |
| NO_DEVICE_CODE; | |
| return -1; |
I changed the logic to something like this on master, it should make development less tedious w.r.t. fighting the compiler.
| } | ||
| } | ||
| #elif defined(AMD_WMMA_AVAILABLE) | ||
| #if defined(RDNA4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #if defined(RDNA4) |
I would say to remove ifdefs like this and to maybe re-add them once RDNA3 is supported.
| #if defined(RDNA4) | ||
| // Special tile size to load <16, 8> as <16, 16> for half2 and __hip_bfloat162 | ||
| if constexpr (I == 16 && J == 8 && (std::is_same<T, half2>::value || std::is_same<T, nv_bfloat162>::value)) { | ||
| constexpr int RDNA4_WMMA_MEM_N = 4; | ||
| using TxN_t = __attribute__((ext_vector_type(RDNA4_WMMA_MEM_N))) int32_t; | ||
| reinterpret_cast<TxN_t&>(t.x[0]) = reinterpret_cast<const TxN_t&>(xs0[t.get_i(0) * stride + t.get_j(0)]); | ||
| } else { | ||
| constexpr int RDNA4_WMMA_MEM_N = 8; | ||
| using TxN_t = __attribute__((ext_vector_type(RDNA4_WMMA_MEM_N))) T; | ||
| reinterpret_cast<TxN_t&>(t.x[0]) = reinterpret_cast<const TxN_t&>(xs0[t.get_i(0) * stride + t.get_j(0)]); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I read this code correctly you're just doing a byte copy of a given size. In that case, please use ggml_cuda_memcpy_1 from common.cuh.
| #if !defined(GGML_USE_HIP) | ||
| tile_xy[j0*tile_k_padded + threadIdx.x] = {tmp.x, tmp.y}; | ||
| #else | ||
| if constexpr (std::is_same<T, half2>::value) { | ||
| tile_xy[j0*tile_k_padded + threadIdx.x] = __float22half2_rn(tmp); | ||
| } else if constexpr (std::is_same<T, nv_bfloat162>::value) { | ||
| tile_xy[j0*tile_k_padded + threadIdx.x] = __float22bfloat162_rn(tmp); | ||
| } else { | ||
| static_assert(0, "unsupported type"); | ||
| } | ||
| #endif // !defined(GGML_USE_HIP) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In convert.cuh there is a function ggml_cuda_cast for handling compilation failures due to type conversions in a consistent way. So far it only handles scalar types, please extend it to handle the conversion from float2 to half2/nv_bfloat162 as well.
| #if defined(AMD_WMMA_AVAILABLE) | ||
| typedef tile<16, 8, T> tile_B; | ||
| #else | ||
| typedef tile< 8, 8, T> tile_B; | ||
| #endif // defined(AMD_WMMA_AVAILABLE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The macro AMD_WMMA_AVAILABLE is only for device code. In host code, always use the function amd_wmma_available.
| //This kernel is for larger batch sizes of mul_mat_id | ||
| template <typename T, int rows_per_block, int cols_per_block, int nwarps> | ||
| __launch_bounds__(ggml_cuda_get_physical_warp_size()*nwarps, 1) | ||
| static __global__ void mul_mat_f_ids( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the static asserts from mma.cuh as I did for NVIDIA, then you no longer need to wrap the actual kernel in an if constexpr.
|
@zhang-hui-yulo as it turns out I'll need to touch the MMA FA kernel in the near future regardless of additional hardware support so I'd suggest we do it like this: first I make some changes to the MMA FA kernel during which I'll also add Volta support. Afterwards you can add AMD WMMA support, with the Volta implementation serving as a checklist where in the code it's necessary to make changes due to the different data layout. |
Add RDNA4 tensor core support for MMF, honestly the performance is lower than expectation. The model is at https://huggingface.co/Mungert/DeepSeek-R1-0528-Qwen3-8B-GGUF
@JohannesGaessler, looks like that #16988 changes mmf.cu to
then native mmf won't be excised on my RDNA4, it always uses hipblas path.