[AMDGPU] Legalize vector fminimum and fmaximum with VOP3P #138971

rampitec · 2025-05-07T21:30:10Z

Co-authored-by: Matt Arsenault [email protected]

rampitec · 2025-05-07T21:30:26Z

[AMDGPU] Legalize vector fminimum and fmaximum with VOP3P #138971 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

llvmbot · 2025-05-07T21:31:11Z

@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-backend-amdgpu

Author: Stanislav Mekhanoshin (rampitec)

Changes

Original patches by Matthew Arsenault.

Patch is 151.05 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/138971.diff

7 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+16)
(modified) llvm/test/Analysis/CostModel/AMDGPU/maximum.ll (+12-12)
(modified) llvm/test/Analysis/CostModel/AMDGPU/minimum.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/fmaximum3.ll (+161-167)
(modified) llvm/test/CodeGen/AMDGPU/fminimum3.ll (+161-167)
(modified) llvm/test/CodeGen/AMDGPU/vector-reduce-fmaximum.ll (+284-303)
(modified) llvm/test/CodeGen/AMDGPU/vector-reduce-fminimum.ll (+284-303)

diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index b08b6b46fc52c..5b53e3d0c2373 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -861,6 +861,22 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
   if (Subtarget->hasIEEEMinMax()) {
     setOperationAction({ISD::FMAXIMUM, ISD::FMINIMUM},
                        {MVT::f16, MVT::f32, MVT::f64, MVT::v2f16}, Legal);
+  } else {
+    // FIXME: For nnan fmaximum, emit the fmaximum3 instead of fmaxnum
+    if (Subtarget->hasMinimum3Maximum3F32())
+      setOperationAction({ISD::FMAXIMUM, ISD::FMINIMUM}, MVT::f32, Legal);
+
+    if (Subtarget->hasMinimum3Maximum3PKF16()) {
+      setOperationAction({ISD::FMAXIMUM, ISD::FMINIMUM}, MVT::v2f16, Legal);
+
+      // If only the vector form is available, we need to widen to a vector.
+      if (!Subtarget->hasMinimum3Maximum3F16())
+        setOperationAction({ISD::FMAXIMUM, ISD::FMINIMUM}, MVT::f16, Custom);
+    }
+  }
+
+  if (Subtarget->hasVOP3PInsts()) {
+    // We want to break these into v2f16 pieces, not scalarize.
     setOperationAction({ISD::FMINIMUM, ISD::FMAXIMUM},
                        {MVT::v4f16, MVT::v8f16, MVT::v16f16, MVT::v32f16},
                        Custom);
diff --git a/llvm/test/Analysis/CostModel/AMDGPU/maximum.ll b/llvm/test/Analysis/CostModel/AMDGPU/maximum.ll
index 603e04fc7a7a7..3774c6c0cbbee 100644
--- a/llvm/test/Analysis/CostModel/AMDGPU/maximum.ll
+++ b/llvm/test/Analysis/CostModel/AMDGPU/maximum.ll
@@ -11,19 +11,19 @@ define void @maximum_f16() {
 ; GFX950-FASTF64-LABEL: 'maximum_f16'
 ; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %f16 = call half @llvm.maximum.f16(half undef, half undef)
 ; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %v2f16 = call <2 x half> @llvm.maximum.v2f16(<2 x half> undef, <2 x half> undef)
-; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %v3f16 = call <3 x half> @llvm.maximum.v3f16(<3 x half> undef, <3 x half> undef)
-; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 11 for instruction: %v4f16 = call <4 x half> @llvm.maximum.v4f16(<4 x half> undef, <4 x half> undef)
-; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %v8f16 = call <8 x half> @llvm.maximum.v8f16(<8 x half> undef, <8 x half> undef)
-; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %v16f16 = call <16 x half> @llvm.maximum.v16f16(<16 x half> undef, <16 x half> undef)
+; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v3f16 = call <3 x half> @llvm.maximum.v3f16(<3 x half> undef, <3 x half> undef)
+; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v4f16 = call <4 x half> @llvm.maximum.v4f16(<4 x half> undef, <4 x half> undef)
+; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v8f16 = call <8 x half> @llvm.maximum.v8f16(<8 x half> undef, <8 x half> undef)
+; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v16f16 = call <16 x half> @llvm.maximum.v16f16(<16 x half> undef, <16 x half> undef)
 ; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: ret void
 ;
 ; GFX9-LABEL: 'maximum_f16'
 ; GFX9-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %f16 = call half @llvm.maximum.f16(half undef, half undef)
 ; GFX9-NEXT:  Cost Model: Found an estimated cost of 21 for instruction: %v2f16 = call <2 x half> @llvm.maximum.v2f16(<2 x half> undef, <2 x half> undef)
-; GFX9-NEXT:  Cost Model: Found an estimated cost of 32 for instruction: %v3f16 = call <3 x half> @llvm.maximum.v3f16(<3 x half> undef, <3 x half> undef)
-; GFX9-NEXT:  Cost Model: Found an estimated cost of 43 for instruction: %v4f16 = call <4 x half> @llvm.maximum.v4f16(<4 x half> undef, <4 x half> undef)
-; GFX9-NEXT:  Cost Model: Found an estimated cost of 87 for instruction: %v8f16 = call <8 x half> @llvm.maximum.v8f16(<8 x half> undef, <8 x half> undef)
-; GFX9-NEXT:  Cost Model: Found an estimated cost of 175 for instruction: %v16f16 = call <16 x half> @llvm.maximum.v16f16(<16 x half> undef, <16 x half> undef)
+; GFX9-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v3f16 = call <3 x half> @llvm.maximum.v3f16(<3 x half> undef, <3 x half> undef)
+; GFX9-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v4f16 = call <4 x half> @llvm.maximum.v4f16(<4 x half> undef, <4 x half> undef)
+; GFX9-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v8f16 = call <8 x half> @llvm.maximum.v8f16(<8 x half> undef, <8 x half> undef)
+; GFX9-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v16f16 = call <16 x half> @llvm.maximum.v16f16(<16 x half> undef, <16 x half> undef)
 ; GFX9-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: ret void
 ;
 ; SLOWF64-LABEL: 'maximum_f16'
@@ -38,10 +38,10 @@ define void @maximum_f16() {
 ; GFX9-SIZE-LABEL: 'maximum_f16'
 ; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %f16 = call half @llvm.maximum.f16(half undef, half undef)
 ; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v2f16 = call <2 x half> @llvm.maximum.v2f16(<2 x half> undef, <2 x half> undef)
-; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v3f16 = call <3 x half> @llvm.maximum.v3f16(<3 x half> undef, <3 x half> undef)
-; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %v4f16 = call <4 x half> @llvm.maximum.v4f16(<4 x half> undef, <4 x half> undef)
-; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %v8f16 = call <8 x half> @llvm.maximum.v8f16(<8 x half> undef, <8 x half> undef)
-; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 31 for instruction: %v16f16 = call <16 x half> @llvm.maximum.v16f16(<16 x half> undef, <16 x half> undef)
+; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v3f16 = call <3 x half> @llvm.maximum.v3f16(<3 x half> undef, <3 x half> undef)
+; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v4f16 = call <4 x half> @llvm.maximum.v4f16(<4 x half> undef, <4 x half> undef)
+; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v8f16 = call <8 x half> @llvm.maximum.v8f16(<8 x half> undef, <8 x half> undef)
+; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v16f16 = call <16 x half> @llvm.maximum.v16f16(<16 x half> undef, <16 x half> undef)
 ; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: ret void
 ;
 ; SLOW-SIZE-LABEL: 'maximum_f16'
diff --git a/llvm/test/Analysis/CostModel/AMDGPU/minimum.ll b/llvm/test/Analysis/CostModel/AMDGPU/minimum.ll
index 4507ba4929f1b..24b9549dfe3a4 100644
--- a/llvm/test/Analysis/CostModel/AMDGPU/minimum.ll
+++ b/llvm/test/Analysis/CostModel/AMDGPU/minimum.ll
@@ -11,19 +11,19 @@ define void @minimum_f16() {
 ; GFX950-FASTF64-LABEL: 'minimum_f16'
 ; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %f16 = call half @llvm.minimum.f16(half undef, half undef)
 ; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %v2f16 = call <2 x half> @llvm.minimum.v2f16(<2 x half> undef, <2 x half> undef)
-; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %v3f16 = call <3 x half> @llvm.minimum.v3f16(<3 x half> undef, <3 x half> undef)
-; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 11 for instruction: %v4f16 = call <4 x half> @llvm.minimum.v4f16(<4 x half> undef, <4 x half> undef)
-; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %v8f16 = call <8 x half> @llvm.minimum.v8f16(<8 x half> undef, <8 x half> undef)
-; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %v16f16 = call <16 x half> @llvm.minimum.v16f16(<16 x half> undef, <16 x half> undef)
+; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v3f16 = call <3 x half> @llvm.minimum.v3f16(<3 x half> undef, <3 x half> undef)
+; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v4f16 = call <4 x half> @llvm.minimum.v4f16(<4 x half> undef, <4 x half> undef)
+; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v8f16 = call <8 x half> @llvm.minimum.v8f16(<8 x half> undef, <8 x half> undef)
+; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v16f16 = call <16 x half> @llvm.minimum.v16f16(<16 x half> undef, <16 x half> undef)
 ; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: ret void
 ;
 ; GFX9-LABEL: 'minimum_f16'
 ; GFX9-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %f16 = call half @llvm.minimum.f16(half undef, half undef)
 ; GFX9-NEXT:  Cost Model: Found an estimated cost of 21 for instruction: %v2f16 = call <2 x half> @llvm.minimum.v2f16(<2 x half> undef, <2 x half> undef)
-; GFX9-NEXT:  Cost Model: Found an estimated cost of 32 for instruction: %v3f16 = call <3 x half> @llvm.minimum.v3f16(<3 x half> undef, <3 x half> undef)
-; GFX9-NEXT:  Cost Model: Found an estimated cost of 43 for instruction: %v4f16 = call <4 x half> @llvm.minimum.v4f16(<4 x half> undef, <4 x half> undef)
-; GFX9-NEXT:  Cost Model: Found an estimated cost of 87 for instruction: %v8f16 = call <8 x half> @llvm.minimum.v8f16(<8 x half> undef, <8 x half> undef)
-; GFX9-NEXT:  Cost Model: Found an estimated cost of 175 for instruction: %v16f16 = call <16 x half> @llvm.minimum.v16f16(<16 x half> undef, <16 x half> undef)
+; GFX9-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v3f16 = call <3 x half> @llvm.minimum.v3f16(<3 x half> undef, <3 x half> undef)
+; GFX9-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v4f16 = call <4 x half> @llvm.minimum.v4f16(<4 x half> undef, <4 x half> undef)
+; GFX9-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v8f16 = call <8 x half> @llvm.minimum.v8f16(<8 x half> undef, <8 x half> undef)
+; GFX9-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v16f16 = call <16 x half> @llvm.minimum.v16f16(<16 x half> undef, <16 x half> undef)
 ; GFX9-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: ret void
 ;
 ; SLOWF64-LABEL: 'minimum_f16'
@@ -38,10 +38,10 @@ define void @minimum_f16() {
 ; GFX9-SIZE-LABEL: 'minimum_f16'
 ; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %f16 = call half @llvm.minimum.f16(half undef, half undef)
 ; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v2f16 = call <2 x half> @llvm.minimum.v2f16(<2 x half> undef, <2 x half> undef)
-; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v3f16 = call <3 x half> @llvm.minimum.v3f16(<3 x half> undef, <3 x half> undef)
-; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %v4f16 = call <4 x half> @llvm.minimum.v4f16(<4 x half> undef, <4 x half> undef)
-; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %v8f16 = call <8 x half> @llvm.minimum.v8f16(<8 x half> undef, <8 x half> undef)
-; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 31 for instruction: %v16f16 = call <16 x half> @llvm.minimum.v16f16(<16 x half> undef, <16 x half> undef)
+; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v3f16 = call <3 x half> @llvm.minimum.v3f16(<3 x half> undef, <3 x half> undef)
+; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v4f16 = call <4 x half> @llvm.minimum.v4f16(<4 x half> undef, <4 x half> undef)
+; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v8f16 = call <8 x half> @llvm.minimum.v8f16(<8 x half> undef, <8 x half> undef)
+; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v16f16 = call <16 x half> @llvm.minimum.v16f16(<16 x half> undef, <16 x half> undef)
 ; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: ret void
 ;
 ; SLOW-SIZE-LABEL: 'minimum_f16'
diff --git a/llvm/test/CodeGen/AMDGPU/fmaximum3.ll b/llvm/test/CodeGen/AMDGPU/fmaximum3.ll
index 567202be69fa6..53d940e1e6c1a 100644
--- a/llvm/test/CodeGen/AMDGPU/fmaximum3.ll
+++ b/llvm/test/CodeGen/AMDGPU/fmaximum3.ll
@@ -2375,21 +2375,21 @@ define <3 x half> @v_fmaximum3_v3f16(<3 x half> %a, <3 x half> %b, <3 x half> %c
 ; GFX942-NEXT:    v_cndmask_b32_e32 v8, v7, v6, vcc
 ; GFX942-NEXT:    v_lshrrev_b32_e32 v6, 16, v6
 ; GFX942-NEXT:    v_cmp_o_f16_sdwa vcc, v0, v2 src0_sel:WORD_1 src1_sel:WORD_1
-; GFX942-NEXT:    v_pk_max_f16 v2, v1, v3
-; GFX942-NEXT:    s_nop 0
+; GFX942-NEXT:    s_nop 1
 ; GFX942-NEXT:    v_cndmask_b32_e32 v0, v7, v6, vcc
+; GFX942-NEXT:    v_pk_max_f16 v6, v1, v3
 ; GFX942-NEXT:    v_cmp_o_f16_e32 vcc, v1, v3
-; GFX942-NEXT:    s_nop 1
-; GFX942-NEXT:    v_cndmask_b32_e32 v6, v7, v2, vcc
-; GFX942-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GFX942-NEXT:    v_perm_b32 v2, v0, v8, s0
+; GFX942-NEXT:    v_pk_max_f16 v2, v4, v2
+; GFX942-NEXT:    v_cndmask_b32_e32 v9, v7, v6, vcc
+; GFX942-NEXT:    v_lshrrev_b32_e32 v6, 16, v6
 ; GFX942-NEXT:    v_cmp_o_f16_sdwa vcc, v1, v3 src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX942-NEXT:    s_nop 1
-; GFX942-NEXT:    v_cndmask_b32_e32 v1, v7, v2, vcc
-; GFX942-NEXT:    v_perm_b32 v1, v1, v6, s0
+; GFX942-NEXT:    v_cndmask_b32_e32 v1, v7, v6, vcc
+; GFX942-NEXT:    v_perm_b32 v1, v1, v9, s0
 ; GFX942-NEXT:    v_pk_max_f16 v1, v5, v1
-; GFX942-NEXT:    v_cmp_o_f16_e32 vcc, v5, v6
-; GFX942-NEXT:    v_perm_b32 v2, v0, v8, s0
-; GFX942-NEXT:    v_pk_max_f16 v2, v4, v2
+; GFX942-NEXT:    v_cmp_o_f16_e32 vcc, v5, v9
+; GFX942-NEXT:    s_nop 1
 ; GFX942-NEXT:    v_cndmask_b32_e32 v1, v7, v1, vcc
 ; GFX942-NEXT:    v_cmp_o_f16_e32 vcc, v4, v8
 ; GFX942-NEXT:    s_nop 1
@@ -2437,21 +2437,21 @@ define <3 x half> @v_fmaximum3_v3f16_commute(<3 x half> %a, <3 x half> %b, <3 x
 ; GFX942-NEXT:    v_cndmask_b32_e32 v8, v7, v6, vcc
 ; GFX942-NEXT:    v_lshrrev_b32_e32 v6, 16, v6
 ; GFX942-NEXT:    v_cmp_o_f16_sdwa vcc, v0, v2 src0_sel:WORD_1 src1_sel:WORD_1
-; GFX942-NEXT:    v_pk_max_f16 v2, v1, v3
-; GFX942-NEXT:    s_nop 0
+; GFX942-NEXT:    s_nop 1
 ; GFX942-NEXT:    v_cndmask_b32_e32 v0, v7, v6, vcc
+; GFX942-NEXT:    v_pk_max_f16 v6, v1, v3
 ; GFX942-NEXT:    v_cmp_o_f16_e32 vcc, v1, v3
-; GFX942-NEXT:    s_nop 1
-; GFX942-NEXT:    v_cndmask_b32_e32 v6, v7, v2, vcc
-; GFX942-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GFX942-NEXT:    v_perm_b32 v2, v0, v8, s0
+; GFX942-NEXT:    v_pk_max_f16 v2, v2, v4
+; GFX942-NEXT:    v_cndmask_b32_e32 v9, v7, v6, vcc
+; GFX942-NEXT:    v_lshrrev_b32_e32 v6, 16, v6
 ; GFX942-NEXT:    v_cmp_o_f16_sdwa vcc, v1, v3 src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX942-NEXT:    s_nop 1
-; GFX942-NEXT:    v_cndmask_b32_e32 v1, v7, v2, vcc
-; GFX942-NEXT:    v_perm_b32 v1, v1, v6, s0
+; GFX942-NEXT:    v_cndmask_b32_e32 v1, v7, v6, vcc
+; GFX942-NEXT:    v_perm_b32 v1, v1, v9, s0
 ; GFX942-NEXT:    v_pk_max_f16 v1, v1, v5
-; GFX942-NEXT:    v_cmp_o_f16_e32 vcc, v6, v5
-; GFX942-NEXT:    v_perm_b32 v2, v0, v8, s0
-; GFX942-NEXT:    v_pk_max_f16 v2, v2, v4
+; GFX942-NEXT:    v_cmp_o_f16_e32 vcc, v9, v5
+; GFX942-NEXT:    s_nop 1
 ; GFX942-NEXT:    v_cndmask_b32_e32 v1, v7, v1, vcc
 ; GFX942-NEXT:    v_cmp_o_f16_e32 vcc, v8, v4
 ; GFX942-NEXT:    s_nop 1
@@ -2500,40 +2500,40 @@ define <3 x half> @v_fmaximum3_v3f16__fabs_all(<3 x half> %a, <3 x half> %b, <3
 ; GFX942-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX942-NEXT:    v_and_b32_e32 v7, 0x7fff7fff, v1
 ; GFX942-NEXT:    v_and_b32_e32 v9, 0x7fff7fff, v3
+; GFX942-NEXT:    v_pk_max_f16 v7, v7, v9
 ; GFX942-NEXT:    v_and_b32_e32 v6, 0x7fff7fff, v0
 ; GFX942-NEXT:    v_and_b32_e32 v8, 0x7fff7fff, v2
-; GFX942-NEXT:    v_pk_max_f16 v7, v7, v9
-; GFX942-NEXT:    v_mov_b32_e32 v12, 0x7e00
 ; GFX942-NEXT:    v_lshrrev_b32_e32 v9, 16, v7
+; GFX942-NEXT:    v_mov_b32_e32 v12, 0x7e00
 ; GFX942-NEXT:    v_cmp_o_f16_sdwa vcc, |v1|, |v3| src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX942-NEXT:    v_pk_max_f16 v6, v6, v8
 ; GFX942-NEXT:    s_mov_b32 s0, 0x5040100
 ; GFX942-NEXT:    v_cndmask_b32_e32 v9, v12, v9, vcc
-; GFX942-NEXT:    v_lshrrev_b32_e32 v8, 16, v6
-; GFX942-NEXT:    v_cmp_o_f16_sdwa vcc, |v0|, |v2| src0_sel:WORD_1 src1_sel:WORD_1
+; GFX942-NEXT:    v_cmp_o_f16_e64 vcc, |v1|, |v3|
 ; GFX942-NEXT:    v_and_b32_e32 v11, 0x7fff7fff, v4
 ; GFX942-NEXT:    v_and_b32_e32 v10, 0x7fff7fff, v5
-; GFX942-NEXT:    v_cndmask_b32_e32 v8, v12, v8, vcc
-; GFX942-NEXT:    v_cmp_o_f16_e64 vcc, |v1|, |v3|
-; GFX942-NEXT:    s_nop 1
 ; GFX942-NEXT:    v_cndmask_b32_e32 v1, v12, v7, vcc
+; GFX942-NEXT:    v_lshrrev_b32_e32 v7, 16, v6
+; GFX942-NEXT:    v_cmp_o_f16_sdwa vcc, |v0|, |v2| src0_sel:WORD_1 src1_sel:WORD_1
+; GFX942-NEXT:    v_perm_b32 v3, v9, v1, s0
+; GFX942-NEXT:    v_pk_max_f16 v3, v3, v10
+; GFX942-NEXT:    v_cndmask_b32_e32 v7, v12, v7, vcc
 ; GFX942-NEXT:    v_cmp_o_f16_e64 vcc, |v0|, |v2|
 ; GFX942-NEXT:    s_nop 1
 ; GFX942-NEXT:    v_cndmask_b32_e32 v0, v12, v6, vcc
-; GFX942-NEXT:    v_perm_b32 v2, v8, v0, s0
+; GFX942-NEXT:    v_perm_b32 v2, v7, v0, s0
 ; GFX942-NEXT:    v_pk_max_f16 v2, v2, v11
-; GFX942-NEXT:    v_cmp_o_f16_sdwa vcc, v8, |v4| src0_sel:DWORD src1_sel:WORD_1
-; GFX942-NEXT:    v_lshrrev_b32_e32 v3, 16, v2
-; GFX942-NEXT:    v_perm_b32 v6, v9, v1, s0
-; GFX942-NEXT:    v_cndmask_b32_e32 v3, v12, v3, vcc
-; GFX942-NEXT:    v_pk_max_f16 v6, v6, v10
+; GFX942-NEXT:    v_cmp_o_f16_sdwa vcc, v7, |v4| src0_sel:DWORD src1_sel:WORD_1
+; GFX942-NEXT:    v_lshrrev_b32_e32 v6, 16, v2
+; GFX942-NEXT:    s_nop 0
+; GFX942-NEXT:    v_cndmask_b32_e32 v6, v12, v6, vcc
 ; GFX942-NEXT:    v_cmp_o_f16_e64 vcc, v1, |v5|
 ; GFX942-NEXT:    s_nop 1
-; GFX942-NEXT:    v_cndmask_b32_e32 v1, v12, v6, vcc
+; GFX942-NEXT:    v_cndmask_b32_e32 v1, v12, v3, vcc
 ; GFX942-NEXT:    v_cmp_o_f16_e64 vcc, v0, |v4|
 ; GFX942-NEXT:    s_nop 1
 ; GFX942-NEXT:    v_cndmask_b32_e32 v0, v12, v2, vcc
-; GFX942-NEXT:    v_perm_b32 v0, v3, v0, s0
+; GFX942-NEXT:    v_perm_b32 v0, v6, v0, s0
 ; GFX942-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX950-LABEL: v_fmaximum3_v3f16__fabs_all:
@@ -2582,21 +2582,21 @@ define <3 x half> @v_fmaximum3_v3f16__fneg_all(<3 x half> %a, <3 x half> %b, <3
 ; GFX942-NEXT:    v_cndmask_b32_e32 v8, v7, v6, vcc
 ; GFX942-NEXT:    v_lshrrev_b32_e32 v6, 16, v6
 ; GFX942-NEXT:    v_cmp_o_f16_sdwa vcc, -v0, -v2 src0_sel:WORD_1 src1_sel:WORD_1
-; GFX942-NEXT:    v_pk_max_f16 v2, v1, v3 neg_lo:[1,1] neg_hi:[1,1]
-; GFX942-NEXT:    s_nop 0
+; GFX942-NEXT:    s_nop 1
 ; GFX942-NEXT:    v_cndmask_b32_e32 v0, v7, v6, vcc
+; GFX942-NEXT:    v_pk_max_f16 v6, v1, v3 neg_lo:[1,1] neg_hi:[1,1]
 ; GFX942-NEXT:    v_cmp_o_f16_e64 vcc, -v1, -v3
-; GFX942-NEXT:    s_nop 1
-; GFX942-NEXT:    v_cndmask_b32_e32 v6, v7, v2, vcc
-; GFX942-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GFX942-NEXT:    v_perm_b32 v2, v0, v8, s0
+; GFX942-NEXT:    v_pk_max_f16 v2, v2, v4 neg_lo:[0,1] neg_hi:[0,1]
+; GFX942-NEXT:    v_cndmask_b32_e32 v9, v7, v6, vcc
+; GFX942-NEXT:    v_lshrrev_b32_e32 v6, 16, v6
 ; GFX942-NEXT:    v_cmp_o_f16_sdwa vcc, -v1, -v3 src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX942-NEXT:    s_nop 1
-; GFX942-NEXT:    v_cndmask_b32_e32 v1, v7, v2, vcc
-; GFX942-NEXT:    v_perm_b32 v1, v1, v6, s0
+; GFX942-NEXT:    v_cndmask_b32_e32 v1, v7, v6, vcc
+; GFX942-NEXT:    v_perm_b32 v1, v1, v9, s0
 ; GFX942-NEXT:    v_pk_max_f16 v1, v1, v5 neg_lo:[0,1] neg_hi:[0,1]
-; GFX942-NEXT:    v_cmp_o_f16_e64 vcc, v6, -v5
-; GFX942-NEXT:    v_perm_b32 v2, v0, v8, s0
-; GFX942-NEXT:    v_pk_max_f16 v2, v2, v4 neg_lo:[0,1] neg_hi:[0,1]
+; GFX942-NEXT:    v_cmp_o_f16_e64 vcc, v9, -v5
+; GFX942-NEXT:    s_nop 1
 ; GFX942-NEXT:    v_cndmask_b32_e32 v1, v7, v1, vcc
 ; GFX942-NEXT:    v_cmp_o_f16_e64 vcc, v8, -v4
 ; GFX942-NEXT:    s_nop 1
@@ -2643,22 +2643,21 @@ define <3 x half> @v_fmaximum3_v3f16__inlineimm1(<3 x half> %a, <3 x half> %c) {
 ; GFX942-NEXT:    v_pk_max_f16 v4, v0, 2.0 op_sel_hi:[1,0]
 ; GFX942-NEXT:    v_mov_b32_e32 v5, 0x7e00
 ; GFX942-NEXT:    v_cndmask_b32_sdwa v6, v5, v4, vcc dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
-; GFX942-NEX...
[truncated]

arsenm

This was already upstreamed in e57b327

arsenm · 2025-05-08T09:46:51Z

llvm/test/CodeGen/AMDGPU/fminimum3.ll

-; GFX942-NEXT:    v_cmp_o_f16_e32 vcc, v4, v4
+; GFX942-NEXT:    v_cndmask_b32_sdwa v1, v5, v4, vcc dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
+; GFX942-NEXT:    v_perm_b32 v3, v1, v7, s0
+; GFX942-NEXT:    v_pk_min_f16 v3, v3, 4.0 op_sel_hi:[1,0]


The test changes are all for gfx942, something else is going on here

It now exactly matches downstream though.

In this version it is not even gfx950 specific.

mariusz-sikora-at-amd · 2025-05-08T10:01:13Z

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

+    if (Subtarget->hasMinimum3Maximum3F32())
+      setOperationAction({ISD::FMAXIMUM, ISD::FMINIMUM}, MVT::f32, Legal);
+
+    if (Subtarget->hasMinimum3Maximum3PKF16()) {


The same code is added in L888 which is under else of if (Subtarget->hasVOP3PInsts()). Probably we should remove code from L888 and L892

Yes, it was misplaced. It is just an impossible combination to have no VOP3P and have f16 and min3/max3. I have removed the whole else block and nothing did change.

rampitec · 2025-05-08T19:31:55Z

This was already upstreamed in e57b327

Only for f32.

Original patches by Matthew Arsenault.

rampitec requested a review from arsenm May 7, 2025 21:30

rampitec marked this pull request as ready for review May 7, 2025 21:30

llvmbot added backend:AMDGPU llvm:analysis labels May 7, 2025

arsenm reviewed May 8, 2025

View reviewed changes

mariusz-sikora-at-amd reviewed May 8, 2025

View reviewed changes

[AMDGPU] Legalize fminimum and fmaximum for gfx950

6d72957

Original patches by Matthew Arsenault.

rampitec force-pushed the users/rampitec/05-07-_amdgpu_legalize_fminimum_and_fmaximum_for_gfx950 branch from 8bad4c6 to 6d72957 Compare May 8, 2025 19:36

rampitec changed the title ~~[AMDGPU] Legalize fminimum and fmaximum for gfx950~~ [AMDGPU] Legalize vector fminimum and fmaximum with VOP3P May 8, 2025

arsenm approved these changes May 8, 2025

View reviewed changes

rampitec merged commit d2c5fbe into main May 9, 2025
9 of 11 checks passed

rampitec deleted the users/rampitec/05-07-_amdgpu_legalize_fminimum_and_fmaximum_for_gfx950 branch May 9, 2025 05:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] Legalize vector fminimum and fmaximum with VOP3P #138971

[AMDGPU] Legalize vector fminimum and fmaximum with VOP3P #138971

rampitec commented May 7, 2025 •

edited by arsenm

Loading

rampitec commented May 7, 2025

llvmbot commented May 7, 2025 •

edited

Loading

arsenm left a comment

arsenm May 8, 2025

rampitec May 8, 2025

rampitec May 8, 2025

mariusz-sikora-at-amd May 8, 2025

rampitec May 8, 2025

rampitec commented May 8, 2025

[AMDGPU] Legalize vector fminimum and fmaximum with VOP3P #138971

[AMDGPU] Legalize vector fminimum and fmaximum with VOP3P #138971

Conversation

rampitec commented May 7, 2025 • edited by arsenm Loading

rampitec commented May 7, 2025

llvmbot commented May 7, 2025 • edited Loading

arsenm left a comment

Choose a reason for hiding this comment

arsenm May 8, 2025

Choose a reason for hiding this comment

rampitec May 8, 2025

Choose a reason for hiding this comment

rampitec May 8, 2025

Choose a reason for hiding this comment

mariusz-sikora-at-amd May 8, 2025

Choose a reason for hiding this comment

rampitec May 8, 2025

Choose a reason for hiding this comment

rampitec commented May 8, 2025

rampitec commented May 7, 2025 •

edited by arsenm

Loading

llvmbot commented May 7, 2025 •

edited

Loading