Skip to content

[AMDGPU] Fold multiple aligned v_mov_b32 to v_mov_b64 on gfx942 #138843

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

JanekvO
Copy link
Contributor

@JanekvO JanekvO commented May 7, 2025

Follow up to previously abandoned #128017 and landed pre-commit test in #129703
Utilize v_mov_b64, if possible, by folding REG_SEQUENCEs of immediate materializations using v_mov_b32. Still has some superfluous materializations of v_mov_b32 that would previously be MachineCSEd.

@llvmbot
Copy link
Member

llvmbot commented May 7, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Janek van Oirschot (JanekvO)

Changes

Follow up to previously abandoned #128017 and landed pre-commit test in #129703
Utilize v_mov_b64, if possible, by folding REG_SEQUENCEs of immediate materializations using v_mov_b32. Still has some superfluous materializations of v_mov_b32 that would previously be MachineCSEd.


Patch is 29.10 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/138843.diff

8 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIFoldOperands.cpp (+99-3)
  • (modified) llvm/test/CodeGen/AMDGPU/flat-scratch.ll (+4-8)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.gfx90a.ll (+20-10)
  • (modified) llvm/test/CodeGen/AMDGPU/masked-load-vectortypes.ll (+37-50)
  • (modified) llvm/test/CodeGen/AMDGPU/mfma-loop.ll (+33-32)
  • (added) llvm/test/CodeGen/AMDGPU/si-fold-operands-subreg-imm-gfx942.mir (+99)
  • (modified) llvm/test/CodeGen/AMDGPU/smfmac_no_agprs.ll (+11-14)
  • (modified) llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll (+2-3)
diff --git a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
index 66e674949c047..e38975b82d6e7 100644
--- a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
@@ -147,6 +147,7 @@ class SIFoldOperandsImpl {
   std::pair<const MachineOperand *, int> isOMod(const MachineInstr &MI) const;
   bool tryFoldOMod(MachineInstr &MI);
   bool tryFoldRegSequence(MachineInstr &MI);
+  bool tryFoldImmRegSequence(MachineInstr &MI);
   bool tryFoldPhiAGPR(MachineInstr &MI);
   bool tryFoldLoad(MachineInstr &MI);
 
@@ -2127,6 +2128,99 @@ bool SIFoldOperandsImpl::tryFoldOMod(MachineInstr &MI) {
   return true;
 }
 
+// gfx942+ can use V_MOV_B64 for materializing constant immediates.
+// For example:
+// %0:vgpr_32 = V_MOV_B32 0, implicit $exec
+// %1:vreg_64_align2 = REG_SEQUENCE %0, %subreg.sub0, %0, %subreg.sub1
+//  ->
+// %1:vreg_64_align2 = V_MOV_B64_PSEUDO 0, implicit $exec
+bool SIFoldOperandsImpl::tryFoldImmRegSequence(MachineInstr &MI) {
+  assert(MI.isRegSequence());
+  auto Reg = MI.getOperand(0).getReg();
+  const TargetRegisterClass *DefRC = MRI->getRegClass(Reg);
+
+  if (!ST->hasMovB64() || !TRI->isVGPR(*MRI, Reg) ||
+      !MRI->hasOneNonDBGUse(Reg) || !TRI->isProperlyAlignedRC(*DefRC))
+    return false;
+
+  SmallVector<std::pair<MachineOperand *, unsigned>, 32> Defs;
+  if (!getRegSeqInit(Defs, Reg, MCOI::OPERAND_REGISTER))
+    return false;
+
+  // Only attempting to fold immediate materializations.
+  if (!Defs.empty() &&
+      !std::all_of(Defs.begin(), Defs.end(),
+                   [](const std::pair<MachineOperand *, unsigned> &Op) {
+                     return Op.first->isImm();
+                   }))
+    return false;
+
+  SmallVector<uint64_t, 8> ImmVals;
+  uint64_t ImmVal = 0;
+  uint64_t ImmSize = 0;
+  for (unsigned i = 0; i < Defs.size(); ++i) {
+    auto &[Op, SubIdx] = Defs[i];
+    unsigned SubRegSize = TRI->getSubRegIdxSize(SubIdx);
+    unsigned Shift = (TRI->getChannelFromSubReg(SubIdx) % 2) * SubRegSize;
+    ImmSize += SubRegSize;
+    ImmVal |= Op->getImm() << Shift;
+
+    if (ImmSize > 64 || SubRegSize == 64)
+      return false;
+
+    if (ImmSize == 64) {
+      // Only 32 bit literals can be encoded.
+      if (!isUInt<32>(ImmVal))
+        return false;
+      ImmVals.push_back(ImmVal);
+      ImmVal = 0;
+      ImmSize = 0;
+    }
+  }
+
+  assert(ImmVals.size() > 0 &&
+         "REG_SEQUENCE should have at least 1 operand pair");
+
+  // Can only combine REG_SEQUENCE into one 64b immediate materialization mov.
+  if (DefRC == TRI->getVGPR64Class()) {
+    BuildMI(*MI.getParent(), MI, MI.getDebugLoc(),
+            TII->get(AMDGPU::V_MOV_B64_PSEUDO), Reg)
+        .addImm(ImmVals[0]);
+    MI.eraseFromParent();
+    return true;
+  }
+
+  if (ImmVals.size() == 1)
+    return false;
+
+  // Can't bail from here on out: modifying the MI.
+
+  // Remove source operands.
+  for (unsigned i = MI.getNumOperands() - 1; i > 0; --i)
+    MI.removeOperand(i);
+
+  for (unsigned i = 0; i < ImmVals.size(); ++i) {
+    const TargetRegisterClass *RC = TRI->getVGPR64Class();
+    auto MovReg = MRI->createVirtualRegister(RC);
+    // Duplicate vmov imm materializations (e.g., splatted operands) should get
+    // combined by MachineCSE pass.
+    BuildMI(*MI.getParent(), MI, MI.getDebugLoc(),
+            TII->get(AMDGPU::V_MOV_B64_PSEUDO), MovReg)
+        .addImm(ImmVals[i]);
+
+    // 2 subregs with no overlap (i.e., sub0_sub1, sub2_sub3, etc.).
+    unsigned SubReg64B =
+        SIRegisterInfo::getSubRegFromChannel(/*Channel=*/i * 2, /*SubRegs=*/2);
+
+    MI.addOperand(MachineOperand::CreateReg(MovReg, /*isDef=*/false));
+    MI.addOperand(MachineOperand::CreateImm(SubReg64B));
+  }
+
+  LLVM_DEBUG(dbgs() << "Folded into " << MI);
+
+  return true;
+}
+
 // Try to fold a reg_sequence with vgpr output and agpr inputs into an
 // instruction which can take an agpr. So far that means a store.
 bool SIFoldOperandsImpl::tryFoldRegSequence(MachineInstr &MI) {
@@ -2556,9 +2650,11 @@ bool SIFoldOperandsImpl::run(MachineFunction &MF) {
         continue;
       }
 
-      if (MI.isRegSequence() && tryFoldRegSequence(MI)) {
-        Changed = true;
-        continue;
+      if (MI.isRegSequence()) {
+        if (tryFoldImmRegSequence(MI) || tryFoldRegSequence(MI)) {
+          Changed = true;
+          continue;
+        }
       }
 
       if (MI.isPHI() && tryFoldPhiAGPR(MI)) {
diff --git a/llvm/test/CodeGen/AMDGPU/flat-scratch.ll b/llvm/test/CodeGen/AMDGPU/flat-scratch.ll
index b5e579b78a59c..a43d9657cfb24 100644
--- a/llvm/test/CodeGen/AMDGPU/flat-scratch.ll
+++ b/llvm/test/CodeGen/AMDGPU/flat-scratch.ll
@@ -4139,8 +4139,7 @@ define void @store_load_i64_aligned(ptr addrspace(5) nocapture %arg) {
 ; GFX942-LABEL: store_load_i64_aligned:
 ; GFX942:       ; %bb.0: ; %bb
 ; GFX942-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX942-NEXT:    v_mov_b32_e32 v2, 15
-; GFX942-NEXT:    v_mov_b32_e32 v3, 0
+; GFX942-NEXT:    v_mov_b64_e32 v[2:3], 15
 ; GFX942-NEXT:    scratch_store_dwordx2 v0, v[2:3], off sc0 sc1
 ; GFX942-NEXT:    s_waitcnt vmcnt(0)
 ; GFX942-NEXT:    scratch_load_dwordx2 v[0:1], v0, off sc0 sc1
@@ -4250,8 +4249,7 @@ define void @store_load_i64_unaligned(ptr addrspace(5) nocapture %arg) {
 ; GFX942-LABEL: store_load_i64_unaligned:
 ; GFX942:       ; %bb.0: ; %bb
 ; GFX942-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX942-NEXT:    v_mov_b32_e32 v2, 15
-; GFX942-NEXT:    v_mov_b32_e32 v3, 0
+; GFX942-NEXT:    v_mov_b64_e32 v[2:3], 15
 ; GFX942-NEXT:    scratch_store_dwordx2 v0, v[2:3], off sc0 sc1
 ; GFX942-NEXT:    s_waitcnt vmcnt(0)
 ; GFX942-NEXT:    scratch_load_dwordx2 v[0:1], v0, off sc0 sc1
@@ -5010,10 +5008,8 @@ define amdgpu_ps void @large_offset() {
 ;
 ; GFX942-LABEL: large_offset:
 ; GFX942:       ; %bb.0: ; %bb
-; GFX942-NEXT:    v_mov_b32_e32 v0, 0
-; GFX942-NEXT:    v_mov_b32_e32 v1, v0
-; GFX942-NEXT:    v_mov_b32_e32 v2, v0
-; GFX942-NEXT:    v_mov_b32_e32 v3, v0
+; GFX942-NEXT:    v_mov_b64_e32 v[0:1], 0
+; GFX942-NEXT:    v_mov_b64_e32 v[2:3], v[0:1]
 ; GFX942-NEXT:    scratch_store_dwordx4 off, v[0:3], off offset:3024 sc0 sc1
 ; GFX942-NEXT:    s_waitcnt vmcnt(0)
 ; GFX942-NEXT:    scratch_load_dwordx4 v[0:3], off, off offset:3024 sc0 sc1
diff --git a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.gfx90a.ll b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.gfx90a.ll
index 352e5eecd7bfe..783f4d3fdaae9 100644
--- a/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.gfx90a.ll
+++ b/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.mfma.gfx90a.ll
@@ -13,8 +13,10 @@ declare i32 @llvm.amdgcn.workitem.id.x()
 ; GCN-LABEL: {{^}}test_mfma_f32_32x32x4bf16_1k:
 ; GCN-DAG:     s_load_dwordx16
 ; GCN-DAG:     s_load_dwordx16
-; GCN-DAG:     v_mov_b32_e32 v[[TWO:[0-9]+]], 2
-; GCN-DAG:     v_mov_b32_e32 v[[ONE:[0-9]+]], 1
+; GFX90A-DAG:  v_mov_b32_e32 v[[TWO:[0-9]+]], 2
+; GFX90A-DAG:  v_mov_b32_e32 v[[ONE:[0-9]+]], 1
+; GFX942-DAG:  v_mov_b64_e32 v[[[ONE:[0-9]+]]:{{[0-9]+}}], 1
+; GFX942-DAG:  v_mov_b64_e32 v[[[TWO:[0-9]+]]:{{[0-9]+}}], 2
 ; GCN-COUNT-32: v_accvgpr_write_b32 a{{[0-9]+}}, s{{[0-9]+}}
 ; GFX90A:      v_mfma_f32_32x32x4bf16_1k a[{{[0-9]+:[0-9]+}}], v[[[ONE]]:{{[0-9]+}}], v[[[TWO]]:{{[0-9]+}}], a[{{[0-9]+:[0-9]+}}] cbsz:1 abid:2 blgp:3
 ; GFX942:      v_mfma_f32_32x32x4_2b_bf16 a[{{[0-9]+:[0-9]+}}], v[[[ONE]]:{{[0-9+]}}], v[[[TWO]]:{{[0-9+]}}], a[{{[0-9]+:[0-9]+}}] cbsz:1 abid:2 blgp:3
@@ -32,8 +34,10 @@ bb:
 
 ; GCN-LABEL: {{^}}test_mfma_f32_16x16x4bf16_1k:
 ; GCN-DAG:      s_load_dwordx16
-; GCN-DAG:      v_mov_b32_e32 v[[TWO:[0-9]+]], 2
-; GCN-DAG:      v_mov_b32_e32 v[[ONE:[0-9]+]], 1
+; GFX90A-DAG:   v_mov_b32_e32 v[[TWO:[0-9]+]], 2
+; GFX90A-DAG:   v_mov_b32_e32 v[[ONE:[0-9]+]], 1
+; GFX942-DAG:   v_mov_b64_e32 v[[[ONE:[0-9]+]]:{{[0-9]+}}], 1
+; GFX942-DAG:   v_mov_b64_e32 v[[[TWO:[0-9]+]]:{{[0-9]+}}], 2
 ; GCN-COUNT-16: v_accvgpr_write_b32 a{{[0-9]+}}, s{{[0-9]+}}
 ; GFX90A:       v_mfma_f32_16x16x4bf16_1k a[{{[0-9]+:[0-9]+}}], v[[[ONE]]:{{[0-9]+}}], v[[[TWO]]:{{[0-9]+}}], a[{{[0-9]+:[0-9]+}}] cbsz:1 abid:2 blgp:3
 ; GFX942:       v_mfma_f32_16x16x4_4b_bf16 a[{{[0-9]+:[0-9]+}}], v[[[ONE]]:{{[0-9+]}}], v[[[TWO]]:{{[0-9+]}}], a[{{[0-9]+:[0-9]+}}] cbsz:1 abid:2 blgp:3
@@ -51,8 +55,10 @@ bb:
 
 ; GCN-LABEL: {{^}}test_mfma_f32_4x4x4bf16_1k:
 ; GCN-DAG:     s_load_dwordx4
-; GCN-DAG:     v_mov_b32_e32 v[[TWO:[0-9]+]], 2
-; GCN-DAG:     v_mov_b32_e32 v[[ONE:[0-9]+]], 1
+; GFX90A-DAG:  v_mov_b32_e32 v[[TWO:[0-9]+]], 2
+; GFX90A-DAG:  v_mov_b32_e32 v[[ONE:[0-9]+]], 1
+; GFX942-DAG:  v_mov_b64_e32 v[[[ONE:[0-9]+]]:{{[0-9]+}}], 1
+; GFX942-DAG:  v_mov_b64_e32 v[[[TWO:[0-9]+]]:{{[0-9]+}}], 2
 ; GCN-COUNT-4: v_accvgpr_write_b32 a{{[0-9]+}}, s{{[0-9]+}}
 ; GFX90A:      v_mfma_f32_4x4x4bf16_1k [[RES:a\[[0-9]+:[0-9]+\]]], v[[[ONE]]:{{[0-9]+}}], v[[[TWO]]:{{[0-9]+}}], a[{{[0-9]+:[0-9]+}}] cbsz:1 abid:2 blgp:3
 ; GFX942:      v_mfma_f32_4x4x4_16b_bf16 [[RES:a\[[0-9]+:[0-9]+\]]], v[[[ONE]]:{{[0-9+]}}], v[[[TWO]]:{{[0-9+]}}], a[{{[0-9]+:[0-9]+}}] cbsz:1 abid:2 blgp:3
@@ -70,8 +76,10 @@ bb:
 
 ; GCN-LABEL: {{^}}test_mfma_f32_32x32x8bf16_1k:
 ; GCN-DAG:      s_load_dwordx16
-; GCN-DAG:      v_mov_b32_e32 v[[TWO:[0-9]+]], 2
-; GCN-DAG:      v_mov_b32_e32 v[[ONE:[0-9]+]], 1
+; GFX90A-DAG:   v_mov_b32_e32 v[[TWO:[0-9]+]], 2
+; GFX90A-DAG:   v_mov_b32_e32 v[[ONE:[0-9]+]], 1
+; GFX942-DAG:   v_mov_b64_e32 v[[[ONE:[0-9]+]]:{{[0-9]+}}], 1
+; GFX942-DAG:   v_mov_b64_e32 v[[[TWO:[0-9]+]]:{{[0-9]+}}], 2
 ; GCN-COUNT-16: v_accvgpr_write_b32 a{{[0-9]+}}, s{{[0-9]+}}
 ; GFX90A:       v_mfma_f32_32x32x8bf16_1k a[{{[0-9]+:[0-9]+}}], v[[[ONE]]:{{[0-9]+}}], v[[[TWO]]:{{[0-9]+}}], a[{{[0-9]+:[0-9]+}}] cbsz:1 abid:2 blgp:3
 ; GFX942:       v_mfma_f32_32x32x8_bf16 a[{{[0-9]+:[0-9]+}}], v[[[ONE]]:{{[0-9+]}}], v[[[TWO]]:{{[0-9+]}}], a[{{[0-9]+:[0-9]+}}] cbsz:1 abid:2 blgp:3
@@ -89,8 +97,10 @@ bb:
 
 ; GCN-LABEL: {{^}}test_mfma_f32_16x16x16bf16_1k:
 ; GCN-DAG:     s_load_dwordx4
-; GCN-DAG:     v_mov_b32_e32 v[[TWO:[0-9]+]], 2
-; GCN-DAG:     v_mov_b32_e32 v[[ONE:[0-9]+]], 1
+; GFX90A-DAG:     v_mov_b32_e32 v[[TWO:[0-9]+]], 2
+; GFX90A-DAG:     v_mov_b32_e32 v[[ONE:[0-9]+]], 1
+; GFX942-DAG:   v_mov_b64_e32 v[[[ONE:[0-9]+]]:{{[0-9]+}}], 1
+; GFX942-DAG:   v_mov_b64_e32 v[[[TWO:[0-9]+]]:{{[0-9]+}}], 2
 ; GCN-COUNT-4: v_accvgpr_write_b32 a{{[0-9]+}}, s{{[0-9]+}}
 ; GFX90A:      v_mfma_f32_16x16x16bf16_1k [[RES:a\[[0-9]+:[0-9]+\]]], v[[[ONE]]:{{[0-9]+}}], v[[[TWO]]:{{[0-9]+}}], a[{{[0-9]+:[0-9]+}}] cbsz:1 abid:2 blgp:3
 ; GFX942:      v_mfma_f32_16x16x16_bf16 [[RES:a\[[0-9]+:[0-9]+\]]], v[[[ONE]]:{{[0-9+]}}], v[[[TWO]]:{{[0-9+]}}], a[{{[0-9]+:[0-9]+}}] cbsz:1 abid:2 blgp:3
diff --git a/llvm/test/CodeGen/AMDGPU/masked-load-vectortypes.ll b/llvm/test/CodeGen/AMDGPU/masked-load-vectortypes.ll
index 3b855a56a5abb..e076afca370ef 100644
--- a/llvm/test/CodeGen/AMDGPU/masked-load-vectortypes.ll
+++ b/llvm/test/CodeGen/AMDGPU/masked-load-vectortypes.ll
@@ -7,12 +7,12 @@ define <2 x i32> @uniform_masked_load_ptr1_mask_v2i32(ptr addrspace(1) inreg noc
 ; GFX942-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX942-NEXT:    v_and_b32_e32 v0, 1, v0
 ; GFX942-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v0
-; GFX942-NEXT:    v_mov_b32_e32 v0, 0
-; GFX942-NEXT:    v_mov_b32_e32 v1, v0
+; GFX942-NEXT:    v_mov_b32_e32 v2, 0
+; GFX942-NEXT:    v_mov_b64_e32 v[0:1], 0
 ; GFX942-NEXT:    s_and_saveexec_b64 s[2:3], vcc
 ; GFX942-NEXT:    s_cbranch_execz .LBB0_2
 ; GFX942-NEXT:  ; %bb.1: ; %cond.load
-; GFX942-NEXT:    global_load_dwordx2 v[0:1], v0, s[0:1]
+; GFX942-NEXT:    global_load_dwordx2 v[0:1], v2, s[0:1]
 ; GFX942-NEXT:  .LBB0_2:
 ; GFX942-NEXT:    s_or_b64 exec, exec, s[2:3]
 ; GFX942-NEXT:    s_waitcnt vmcnt(0)
@@ -30,14 +30,13 @@ define <4 x i32> @uniform_masked_load_ptr1_mask_v4i32(ptr addrspace(1) inreg noc
 ; GFX942-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX942-NEXT:    v_and_b32_e32 v0, 1, v0
 ; GFX942-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v0
-; GFX942-NEXT:    v_mov_b32_e32 v0, 0
-; GFX942-NEXT:    v_mov_b32_e32 v1, v0
-; GFX942-NEXT:    v_mov_b32_e32 v2, v0
-; GFX942-NEXT:    v_mov_b32_e32 v3, v0
+; GFX942-NEXT:    v_mov_b64_e32 v[0:1], 0
+; GFX942-NEXT:    v_mov_b32_e32 v4, 0
+; GFX942-NEXT:    v_mov_b64_e32 v[2:3], v[0:1]
 ; GFX942-NEXT:    s_and_saveexec_b64 s[2:3], vcc
 ; GFX942-NEXT:    s_cbranch_execz .LBB1_2
 ; GFX942-NEXT:  ; %bb.1: ; %cond.load
-; GFX942-NEXT:    global_load_dwordx4 v[0:3], v0, s[0:1]
+; GFX942-NEXT:    global_load_dwordx4 v[0:3], v4, s[0:1]
 ; GFX942-NEXT:  .LBB1_2:
 ; GFX942-NEXT:    s_or_b64 exec, exec, s[2:3]
 ; GFX942-NEXT:    s_waitcnt vmcnt(0)
@@ -55,14 +54,13 @@ define <4 x float> @uniform_masked_load_ptr1_mask_v4f32(ptr addrspace(1) inreg n
 ; GFX942-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX942-NEXT:    v_and_b32_e32 v0, 1, v0
 ; GFX942-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v0
-; GFX942-NEXT:    v_mov_b32_e32 v0, 0
-; GFX942-NEXT:    v_mov_b32_e32 v1, v0
-; GFX942-NEXT:    v_mov_b32_e32 v2, v0
-; GFX942-NEXT:    v_mov_b32_e32 v3, v0
+; GFX942-NEXT:    v_mov_b64_e32 v[0:1], 0
+; GFX942-NEXT:    v_mov_b32_e32 v4, 0
+; GFX942-NEXT:    v_mov_b64_e32 v[2:3], v[0:1]
 ; GFX942-NEXT:    s_and_saveexec_b64 s[2:3], vcc
 ; GFX942-NEXT:    s_cbranch_execz .LBB2_2
 ; GFX942-NEXT:  ; %bb.1: ; %cond.load
-; GFX942-NEXT:    global_load_dwordx4 v[0:3], v0, s[0:1]
+; GFX942-NEXT:    global_load_dwordx4 v[0:3], v4, s[0:1]
 ; GFX942-NEXT:  .LBB2_2:
 ; GFX942-NEXT:    s_or_b64 exec, exec, s[2:3]
 ; GFX942-NEXT:    s_waitcnt vmcnt(0)
@@ -80,20 +78,16 @@ define <8 x i32> @uniform_masked_load_ptr1_mask_v8i32(ptr addrspace(1) inreg noc
 ; GFX942-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX942-NEXT:    v_and_b32_e32 v0, 1, v0
 ; GFX942-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v0
-; GFX942-NEXT:    v_mov_b32_e32 v0, 0
-; GFX942-NEXT:    v_mov_b32_e32 v1, v0
-; GFX942-NEXT:    v_mov_b32_e32 v2, v0
-; GFX942-NEXT:    v_mov_b32_e32 v3, v0
-; GFX942-NEXT:    v_mov_b32_e32 v4, v0
-; GFX942-NEXT:    v_mov_b32_e32 v5, v0
-; GFX942-NEXT:    v_mov_b32_e32 v6, v0
-; GFX942-NEXT:    v_mov_b32_e32 v7, v0
+; GFX942-NEXT:    v_mov_b64_e32 v[0:1], 0
+; GFX942-NEXT:    v_mov_b32_e32 v8, 0
+; GFX942-NEXT:    v_mov_b64_e32 v[2:3], v[0:1]
+; GFX942-NEXT:    v_mov_b64_e32 v[4:5], v[0:1]
+; GFX942-NEXT:    v_mov_b64_e32 v[6:7], v[0:1]
 ; GFX942-NEXT:    s_and_saveexec_b64 s[2:3], vcc
 ; GFX942-NEXT:    s_cbranch_execz .LBB3_2
 ; GFX942-NEXT:  ; %bb.1: ; %cond.load
-; GFX942-NEXT:    global_load_dwordx4 v[4:7], v0, s[0:1] offset:16
-; GFX942-NEXT:    s_nop 0
-; GFX942-NEXT:    global_load_dwordx4 v[0:3], v0, s[0:1]
+; GFX942-NEXT:    global_load_dwordx4 v[4:7], v8, s[0:1] offset:16
+; GFX942-NEXT:    global_load_dwordx4 v[0:3], v8, s[0:1]
 ; GFX942-NEXT:  .LBB3_2:
 ; GFX942-NEXT:    s_or_b64 exec, exec, s[2:3]
 ; GFX942-NEXT:    s_waitcnt vmcnt(0)
@@ -111,20 +105,16 @@ define <8 x float> @uniform_masked_load_ptr1_mask_v8f32(ptr addrspace(1) inreg n
 ; GFX942-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX942-NEXT:    v_and_b32_e32 v0, 1, v0
 ; GFX942-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v0
-; GFX942-NEXT:    v_mov_b32_e32 v0, 0
-; GFX942-NEXT:    v_mov_b32_e32 v1, v0
-; GFX942-NEXT:    v_mov_b32_e32 v2, v0
-; GFX942-NEXT:    v_mov_b32_e32 v3, v0
-; GFX942-NEXT:    v_mov_b32_e32 v4, v0
-; GFX942-NEXT:    v_mov_b32_e32 v5, v0
-; GFX942-NEXT:    v_mov_b32_e32 v6, v0
-; GFX942-NEXT:    v_mov_b32_e32 v7, v0
+; GFX942-NEXT:    v_mov_b64_e32 v[0:1], 0
+; GFX942-NEXT:    v_mov_b32_e32 v8, 0
+; GFX942-NEXT:    v_mov_b64_e32 v[2:3], v[0:1]
+; GFX942-NEXT:    v_mov_b64_e32 v[4:5], v[0:1]
+; GFX942-NEXT:    v_mov_b64_e32 v[6:7], v[0:1]
 ; GFX942-NEXT:    s_and_saveexec_b64 s[2:3], vcc
 ; GFX942-NEXT:    s_cbranch_execz .LBB4_2
 ; GFX942-NEXT:  ; %bb.1: ; %cond.load
-; GFX942-NEXT:    global_load_dwordx4 v[4:7], v0, s[0:1] offset:16
-; GFX942-NEXT:    s_nop 0
-; GFX942-NEXT:    global_load_dwordx4 v[0:3], v0, s[0:1]
+; GFX942-NEXT:    global_load_dwordx4 v[4:7], v8, s[0:1] offset:16
+; GFX942-NEXT:    global_load_dwordx4 v[0:3], v8, s[0:1]
 ; GFX942-NEXT:  .LBB4_2:
 ; GFX942-NEXT:    s_or_b64 exec, exec, s[2:3]
 ; GFX942-NEXT:    s_waitcnt vmcnt(0)
@@ -142,14 +132,13 @@ define <8 x i16> @uniform_masked_load_ptr1_mask_v8i16(ptr addrspace(1) inreg noc
 ; GFX942-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX942-NEXT:    v_and_b32_e32 v0, 1, v0
 ; GFX942-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v0
-; GFX942-NEXT:    v_mov_b32_e32 v0, 0
-; GFX942-NEXT:    v_mov_b32_e32 v1, v0
-; GFX942-NEXT:    v_mov_b32_e32 v2, v0
-; GFX942-NEXT:    v_mov_b32_e32 v3, v0
+; GFX942-NEXT:    v_mov_b64_e32 v[0:1], 0
+; GFX942-NEXT:    v_mov_b32_e32 v4, 0
+; GFX942-NEXT:    v_mov_b64_e32 v[2:3], v[0:1]
 ; GFX942-NEXT:    s_and_saveexec_b64 s[2:3], vcc
 ; GFX942-NEXT:    s_cbranch_execz .LBB5_2
 ; GFX942-NEXT:  ; %bb.1: ; %cond.load
-; GFX942-NEXT:    global_load_dwordx4 v[0:3], v0, s[0:1]
+; GFX942-NEXT:    global_load_dwordx4 v[0:3], v4, s[0:1]
 ; GFX942-NEXT:  .LBB5_2:
 ; GFX942-NEXT:    s_or_b64 exec, exec, s[2:3]
 ; GFX942-NEXT:    s_waitcnt vmcnt(0)
@@ -167,14 +156,13 @@ define <8 x half> @uniform_masked_load_ptr1_mask_v8f16(ptr addrspace(1) inreg no
 ; GFX942-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX942-NEXT:    v_and_b32_e32 v0, 1, v0
 ; GFX942-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v0
-; GFX942-NEXT:    v_mov_b32_e32 v0, 0
-; GFX942-NEXT:    v_mov_b32_e32 v1, v0
-; GFX942-NEXT:    v_mov_b32_e32 v2, v0
-; GFX942-NEXT:    v_mov_b32_e32 v3, v0
+; GFX942-NEXT:    v_mov_b64_e32 v[0:1], 0
+; GFX942-NEXT:    v_mov_b32_e32 v4, 0
+; GFX942-NEXT:    v_mov_b64_e32 v[2:3], v[0:1]
 ; GFX942-NEXT:    s_and_saveexec_b64 s[2:3], vcc
 ; GFX942-NEXT:    s_cbranch_execz .LBB6_2
 ; GFX942-NEXT:  ; %bb.1: ; %cond.load
-; GFX942-NEXT:    global_load_dwordx4 v[0:3], v0, s[0:1]
+; GFX942-NEXT:    global_load_dwordx4 v[0:3], v4, s[0:1]
 ; GFX942-NEXT:  .LBB6_2:
 ; GFX942-NEXT:    s_or_b64 exec, exec, s[2:3]
 ; GFX942-NEXT:    s_waitcnt vmcnt(0)
@@ -192,14 +180,13 @@ define <8 x bfloat> @uniform_masked_load_ptr1_mask_v8bf16(ptr addrspace(1) inreg
 ; GFX942-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX942-NEXT:    v_and_b32_e32 v0, 1, v0
 ; GFX942-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v0
-; GFX942-NEXT:    v_mov_b32_e32 v0, 0
-; GFX942-NEXT:    v_mov_b32_e32 v1, v0
-; GFX942-NEXT:    v_mov_b32_e32 v2, v0
-; GFX942-NEXT:    v_mov_b32_e32 v3, v0
+; GFX942-NEXT:    v_mov_b64_e32 v[0:1], 0
+; GFX942-NEXT:    v_mov_b32_e32 v4, 0
+; GFX942-NEXT:    v_mov_b64_e32 v[2:3], v[0:1]
 ; GFX942-NEXT:    s_and_saveexec_b64 s[2:3], vcc
 ; GFX942-NEXT:    s_cbranch_execz .LBB7_2
 ; GFX942-NEXT:  ; %bb.1: ; %cond.load
-; GFX942-NEXT:    global_load_dwordx4 v[0:3], v0, s[0:1]
+; GFX942-NEXT:    global_load_dwordx4 v[0:3], v4, s[0:1]
 ; GFX942-NEXT:  .LBB7_2:
 ; GFX942-NEXT:    s_or_b64 exec, exec, s[2:3]
 ; GFX942-NEXT:    s_waitcnt vmcnt(0)
diff --git a/llvm/test/CodeGen/AMDGPU/mfma-loop.ll b/llvm/test/CodeGen/AMDGPU/mfma-loop.ll
index a0d587ac68ff1..5800538b7232d 100644
--- a/llvm/test/CodeGen/AMDGPU/mfma-loop.ll
+++ b/llvm/test/CodeGen/AMDGPU/mfma-loop.ll
@@ -2498,39 +2498,40 @@ define amdgpu_kernel void @test_mfma_nested_loop_zeroinit(ptr addrspace(1) %arg)
 ;
 ; GFX942-LABEL: test_mfma_nested_loop_zeroinit:
 ; GFX942:       ; %bb.0: ; %entry
-; GFX942-NEXT:    v_accvgpr_write_b32 a0, 0
+; GFX942-NEXT:    v_mov_b64_e32 v[0:1], 0
+; GFX942-NEXT:    v_accvgpr_write_b32 a0, v0
+; GFX942-NEXT:    v_accvgpr_write_b32 a3, v1
+; GFX942-NEXT:    v_accvgpr_write_b32 a5, v1
+; GFX942-NEXT:    v_accvgpr_write_b32 a7, v1
+; GFX942-NEXT:    v_accvgpr_write_b32 a9, v1
+; GFX942-NEXT:    v_accvgpr_write_b32 a11, v1
+; GFX942-NEXT:    v_accvgpr_write_b32 a13, v1
+; GFX942-NEXT:    v_accvgpr_write_b32 a15, v1
+; GFX942-NEXT:    v_accvgpr_write_b32 a17, v1
+; GFX942-NEXT:    v_accvgpr_write_b32 a19, v1
+; GFX942-NEXT:    v_accvgpr_write_b32 a21, v1
+; GFX942-NEXT:    v_accvgpr_write_b32 a23, v1
+; GFX942-NEXT:    v_accvgpr_write_b32 a25, v1
+; GFX942-NEXT:    v_accvgpr_write_b32 a27, v1
+; GFX942-NEXT:    v_accvgpr_write_b32 a29, v1
+; GFX942-NEXT:    v_accvgpr_write_b32 a31, v1
 ; GFX942-NEXT:    s_mov_b32 s0, 0
-; GFX942-NEXT:    v_accvgpr_mov_b32 a1, a0
-; GFX942-NEXT:    v_accvgpr_mov_b32 a2, a0
-; GFX942-NEXT:    v_accvgpr_mov_b32 a3, a0
-; GFX942-NEXT:    v_accvgpr_mov_b32 a4, a0
-; GFX942-NEXT:    v_accvgpr_mov_b32 a5, a0
-; GFX942-NEXT:    v_accvgpr_mov_b32 a6, a0
-; GFX942-NEXT:    v_accvgpr_mov_b32 a7, a0
-; GFX942-NEXT:    v_accvgpr_mov_b32 a8, a0
-; GFX942-NEXT:    v_accvgp...
[truncated]

@arsenm
Copy link
Contributor

arsenm commented May 7, 2025

As mentioned in the previous PR, I still think this is a legalization issue. We should be legalizing more 64-bit elements up front instead of trying to recombining later. It's easier to deconstruct in machine passes than reconstruct

Comment on lines +2142 to +2143
if (!ST->hasMovB64() || !TRI->isVGPR(*MRI, Reg) ||
!MRI->hasOneNonDBGUse(Reg) || !TRI->isProperlyAlignedRC(*DefRC))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to check if the register is compatible with the register class from the instruction definition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to do so but couldn't figure it out; is this some interaction with the MCInstrDesc/RegisterClasses and I'm missing?

@JanekvO
Copy link
Contributor Author

JanekvO commented May 7, 2025

As mentioned in the previous PR, I still think this is a legalization issue. We should be legalizing more 64-bit elements up front instead of trying to recombining later. It's easier to deconstruct in machine passes than reconstruct

AFAICT, the relevant instructions are already marked legal (I could be missing something, of course). In the previous PR, I observed that iseldag would legalize i32 elements of the vector type into v2i16 of immediates bitcast to i32. What I'm seeing now is that the build_vector ends up whole in selection, after which the following happens:

%0:v2i32 = buildvector i32<0>, i32<0>
-- amdgpu custom isel -->
%1:i32 = s_mov_b32 0
%0:v2i32 = reg_sequence %1, sub0, %1, sub1

I believe that going the early-on-construction would require code in selection again either as direct build_vector selection code or dagcombiner logic (perhaps predicated tablegen pattern can be used?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants