[QST] Why is U8 32x32 subgroup copy (for FP8 or 8-bit A of mixed dtype GEMMs) faster for VNNI layout than plain layout?

Post #374, we have a functional `XE_2D_U8x32x32_LD_N` copy atom, but it has been implemented essentially the same way as `XE_2D_U8x32x32_LD_V` (uses VNNI layout subgroup-level load compiler builtin), because [the two copy atoms end up reading the same values](https://github.com/codeplaysoftware/cutlass-sycl/pull/374#discussion_r2126997995).

However, since `XE_2D_U16x32x32_LD_N` is faster for loading `A` matrix tile for a GEMM than `XE_2D_U16x32x32_LD_V`, is it possible that the underlying compiler copy builtin that was initially used by `XE_2D_U8x32x32_LD_N` for experimentation in #374, viz. `__builtin_IB_subgroup_block_read_flat_u8_m32k16v2`, may have some performance issues, which [made it slower than its VNNI counterpart](https://github.com/codeplaysoftware/cutlass-sycl/issues/357#issuecomment-2913367646) although the latter is supposed to entail some transformation overhead, so it should've been slower than the former?

As per [documentation, subgroup copies](https://github.khronos.org/SPIRV-Registry/extensions/INTEL/SPV_INTEL_2d_block_io.html#_overview) for plain layout should be faster because the VNNI ones entail an additional overhead:

<img width="748" alt="Image" src="https://github.com/user-attachments/assets/1bdcf63d-3706-4d93-926d-fef5d5bed7d8" />

Thanks!

cc @rolandschulz @pengzhao-intel @cfgfung @yuankuns

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QST] Why is U8 32x32 subgroup copy (for FP8 or 8-bit A of mixed dtype GEMMs) faster for VNNI layout than plain layout? #414

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[QST] Why is U8 32x32 subgroup copy (for FP8 or 8-bit A of mixed dtype GEMMs) faster for VNNI layout than plain layout? #414

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions