You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Post #374, we have a functional XE_2D_U8x32x32_LD_N copy atom, but it has been implemented essentially the same way as XE_2D_U8x32x32_LD_V (uses VNNI layout subgroup-level load compiler builtin), because the two copy atoms end up reading the same values.
However, since XE_2D_U16x32x32_LD_N is faster for loading A matrix tile for a GEMM than XE_2D_U16x32x32_LD_V, is it possible that the underlying compiler copy builtin that was initially used by XE_2D_U8x32x32_LD_N for experimentation in #374, viz. __builtin_IB_subgroup_block_read_flat_u8_m32k16v2, may have some performance issues, which made it slower than its VNNI counterpart although the latter is supposed to entail some transformation overhead, so it should've been slower than the former?
As per documentation, subgroup copies for plain layout should be faster because the VNNI ones entail an additional overhead: