Skip to content

[QST] Why is U8 32x32 subgroup copy (for FP8 or 8-bit A of mixed dtype GEMMs) faster for VNNI layout than plain layout? #414

@sanchitintel

Description

@sanchitintel

Post #374, we have a functional XE_2D_U8x32x32_LD_N copy atom, but it has been implemented essentially the same way as XE_2D_U8x32x32_LD_V (uses VNNI layout subgroup-level load compiler builtin), because the two copy atoms end up reading the same values.

However, since XE_2D_U16x32x32_LD_N is faster for loading A matrix tile for a GEMM than XE_2D_U16x32x32_LD_V, is it possible that the underlying compiler copy builtin that was initially used by XE_2D_U8x32x32_LD_N for experimentation in #374, viz. __builtin_IB_subgroup_block_read_flat_u8_m32k16v2, may have some performance issues, which made it slower than its VNNI counterpart although the latter is supposed to entail some transformation overhead, so it should've been slower than the former?

As per documentation, subgroup copies for plain layout should be faster because the VNNI ones entail an additional overhead:

Image

Thanks!

cc @rolandschulz @pengzhao-intel @cfgfung @yuankuns

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions