Skip to content

[AMDGPU] Legalize vector fminimum and fmaximum with VOP3P #138971

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

rampitec
Copy link
Collaborator

@rampitec rampitec commented May 7, 2025

Co-authored-by: Matt Arsenault [email protected]

Copy link
Collaborator Author

rampitec commented May 7, 2025

This stack of pull requests is managed by Graphite. Learn more about stacking.

@rampitec rampitec requested a review from arsenm May 7, 2025 21:30
@rampitec rampitec marked this pull request as ready for review May 7, 2025 21:30
@llvmbot
Copy link
Member

llvmbot commented May 7, 2025

@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-backend-amdgpu

Author: Stanislav Mekhanoshin (rampitec)

Changes

Original patches by Matthew Arsenault.


Patch is 151.05 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/138971.diff

7 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+16)
  • (modified) llvm/test/Analysis/CostModel/AMDGPU/maximum.ll (+12-12)
  • (modified) llvm/test/Analysis/CostModel/AMDGPU/minimum.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/fmaximum3.ll (+161-167)
  • (modified) llvm/test/CodeGen/AMDGPU/fminimum3.ll (+161-167)
  • (modified) llvm/test/CodeGen/AMDGPU/vector-reduce-fmaximum.ll (+284-303)
  • (modified) llvm/test/CodeGen/AMDGPU/vector-reduce-fminimum.ll (+284-303)
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index b08b6b46fc52c..5b53e3d0c2373 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -861,6 +861,22 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
   if (Subtarget->hasIEEEMinMax()) {
     setOperationAction({ISD::FMAXIMUM, ISD::FMINIMUM},
                        {MVT::f16, MVT::f32, MVT::f64, MVT::v2f16}, Legal);
+  } else {
+    // FIXME: For nnan fmaximum, emit the fmaximum3 instead of fmaxnum
+    if (Subtarget->hasMinimum3Maximum3F32())
+      setOperationAction({ISD::FMAXIMUM, ISD::FMINIMUM}, MVT::f32, Legal);
+
+    if (Subtarget->hasMinimum3Maximum3PKF16()) {
+      setOperationAction({ISD::FMAXIMUM, ISD::FMINIMUM}, MVT::v2f16, Legal);
+
+      // If only the vector form is available, we need to widen to a vector.
+      if (!Subtarget->hasMinimum3Maximum3F16())
+        setOperationAction({ISD::FMAXIMUM, ISD::FMINIMUM}, MVT::f16, Custom);
+    }
+  }
+
+  if (Subtarget->hasVOP3PInsts()) {
+    // We want to break these into v2f16 pieces, not scalarize.
     setOperationAction({ISD::FMINIMUM, ISD::FMAXIMUM},
                        {MVT::v4f16, MVT::v8f16, MVT::v16f16, MVT::v32f16},
                        Custom);
diff --git a/llvm/test/Analysis/CostModel/AMDGPU/maximum.ll b/llvm/test/Analysis/CostModel/AMDGPU/maximum.ll
index 603e04fc7a7a7..3774c6c0cbbee 100644
--- a/llvm/test/Analysis/CostModel/AMDGPU/maximum.ll
+++ b/llvm/test/Analysis/CostModel/AMDGPU/maximum.ll
@@ -11,19 +11,19 @@ define void @maximum_f16() {
 ; GFX950-FASTF64-LABEL: 'maximum_f16'
 ; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %f16 = call half @llvm.maximum.f16(half undef, half undef)
 ; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %v2f16 = call <2 x half> @llvm.maximum.v2f16(<2 x half> undef, <2 x half> undef)
-; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %v3f16 = call <3 x half> @llvm.maximum.v3f16(<3 x half> undef, <3 x half> undef)
-; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 11 for instruction: %v4f16 = call <4 x half> @llvm.maximum.v4f16(<4 x half> undef, <4 x half> undef)
-; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %v8f16 = call <8 x half> @llvm.maximum.v8f16(<8 x half> undef, <8 x half> undef)
-; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %v16f16 = call <16 x half> @llvm.maximum.v16f16(<16 x half> undef, <16 x half> undef)
+; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v3f16 = call <3 x half> @llvm.maximum.v3f16(<3 x half> undef, <3 x half> undef)
+; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v4f16 = call <4 x half> @llvm.maximum.v4f16(<4 x half> undef, <4 x half> undef)
+; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v8f16 = call <8 x half> @llvm.maximum.v8f16(<8 x half> undef, <8 x half> undef)
+; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v16f16 = call <16 x half> @llvm.maximum.v16f16(<16 x half> undef, <16 x half> undef)
 ; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: ret void
 ;
 ; GFX9-LABEL: 'maximum_f16'
 ; GFX9-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %f16 = call half @llvm.maximum.f16(half undef, half undef)
 ; GFX9-NEXT:  Cost Model: Found an estimated cost of 21 for instruction: %v2f16 = call <2 x half> @llvm.maximum.v2f16(<2 x half> undef, <2 x half> undef)
-; GFX9-NEXT:  Cost Model: Found an estimated cost of 32 for instruction: %v3f16 = call <3 x half> @llvm.maximum.v3f16(<3 x half> undef, <3 x half> undef)
-; GFX9-NEXT:  Cost Model: Found an estimated cost of 43 for instruction: %v4f16 = call <4 x half> @llvm.maximum.v4f16(<4 x half> undef, <4 x half> undef)
-; GFX9-NEXT:  Cost Model: Found an estimated cost of 87 for instruction: %v8f16 = call <8 x half> @llvm.maximum.v8f16(<8 x half> undef, <8 x half> undef)
-; GFX9-NEXT:  Cost Model: Found an estimated cost of 175 for instruction: %v16f16 = call <16 x half> @llvm.maximum.v16f16(<16 x half> undef, <16 x half> undef)
+; GFX9-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v3f16 = call <3 x half> @llvm.maximum.v3f16(<3 x half> undef, <3 x half> undef)
+; GFX9-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v4f16 = call <4 x half> @llvm.maximum.v4f16(<4 x half> undef, <4 x half> undef)
+; GFX9-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v8f16 = call <8 x half> @llvm.maximum.v8f16(<8 x half> undef, <8 x half> undef)
+; GFX9-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v16f16 = call <16 x half> @llvm.maximum.v16f16(<16 x half> undef, <16 x half> undef)
 ; GFX9-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: ret void
 ;
 ; SLOWF64-LABEL: 'maximum_f16'
@@ -38,10 +38,10 @@ define void @maximum_f16() {
 ; GFX9-SIZE-LABEL: 'maximum_f16'
 ; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %f16 = call half @llvm.maximum.f16(half undef, half undef)
 ; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v2f16 = call <2 x half> @llvm.maximum.v2f16(<2 x half> undef, <2 x half> undef)
-; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v3f16 = call <3 x half> @llvm.maximum.v3f16(<3 x half> undef, <3 x half> undef)
-; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %v4f16 = call <4 x half> @llvm.maximum.v4f16(<4 x half> undef, <4 x half> undef)
-; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %v8f16 = call <8 x half> @llvm.maximum.v8f16(<8 x half> undef, <8 x half> undef)
-; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 31 for instruction: %v16f16 = call <16 x half> @llvm.maximum.v16f16(<16 x half> undef, <16 x half> undef)
+; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v3f16 = call <3 x half> @llvm.maximum.v3f16(<3 x half> undef, <3 x half> undef)
+; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v4f16 = call <4 x half> @llvm.maximum.v4f16(<4 x half> undef, <4 x half> undef)
+; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v8f16 = call <8 x half> @llvm.maximum.v8f16(<8 x half> undef, <8 x half> undef)
+; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v16f16 = call <16 x half> @llvm.maximum.v16f16(<16 x half> undef, <16 x half> undef)
 ; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: ret void
 ;
 ; SLOW-SIZE-LABEL: 'maximum_f16'
diff --git a/llvm/test/Analysis/CostModel/AMDGPU/minimum.ll b/llvm/test/Analysis/CostModel/AMDGPU/minimum.ll
index 4507ba4929f1b..24b9549dfe3a4 100644
--- a/llvm/test/Analysis/CostModel/AMDGPU/minimum.ll
+++ b/llvm/test/Analysis/CostModel/AMDGPU/minimum.ll
@@ -11,19 +11,19 @@ define void @minimum_f16() {
 ; GFX950-FASTF64-LABEL: 'minimum_f16'
 ; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %f16 = call half @llvm.minimum.f16(half undef, half undef)
 ; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %v2f16 = call <2 x half> @llvm.minimum.v2f16(<2 x half> undef, <2 x half> undef)
-; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 8 for instruction: %v3f16 = call <3 x half> @llvm.minimum.v3f16(<3 x half> undef, <3 x half> undef)
-; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 11 for instruction: %v4f16 = call <4 x half> @llvm.minimum.v4f16(<4 x half> undef, <4 x half> undef)
-; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 23 for instruction: %v8f16 = call <8 x half> @llvm.minimum.v8f16(<8 x half> undef, <8 x half> undef)
-; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 47 for instruction: %v16f16 = call <16 x half> @llvm.minimum.v16f16(<16 x half> undef, <16 x half> undef)
+; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v3f16 = call <3 x half> @llvm.minimum.v3f16(<3 x half> undef, <3 x half> undef)
+; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v4f16 = call <4 x half> @llvm.minimum.v4f16(<4 x half> undef, <4 x half> undef)
+; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v8f16 = call <8 x half> @llvm.minimum.v8f16(<8 x half> undef, <8 x half> undef)
+; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v16f16 = call <16 x half> @llvm.minimum.v16f16(<16 x half> undef, <16 x half> undef)
 ; GFX950-FASTF64-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: ret void
 ;
 ; GFX9-LABEL: 'minimum_f16'
 ; GFX9-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: %f16 = call half @llvm.minimum.f16(half undef, half undef)
 ; GFX9-NEXT:  Cost Model: Found an estimated cost of 21 for instruction: %v2f16 = call <2 x half> @llvm.minimum.v2f16(<2 x half> undef, <2 x half> undef)
-; GFX9-NEXT:  Cost Model: Found an estimated cost of 32 for instruction: %v3f16 = call <3 x half> @llvm.minimum.v3f16(<3 x half> undef, <3 x half> undef)
-; GFX9-NEXT:  Cost Model: Found an estimated cost of 43 for instruction: %v4f16 = call <4 x half> @llvm.minimum.v4f16(<4 x half> undef, <4 x half> undef)
-; GFX9-NEXT:  Cost Model: Found an estimated cost of 87 for instruction: %v8f16 = call <8 x half> @llvm.minimum.v8f16(<8 x half> undef, <8 x half> undef)
-; GFX9-NEXT:  Cost Model: Found an estimated cost of 175 for instruction: %v16f16 = call <16 x half> @llvm.minimum.v16f16(<16 x half> undef, <16 x half> undef)
+; GFX9-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v3f16 = call <3 x half> @llvm.minimum.v3f16(<3 x half> undef, <3 x half> undef)
+; GFX9-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v4f16 = call <4 x half> @llvm.minimum.v4f16(<4 x half> undef, <4 x half> undef)
+; GFX9-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v8f16 = call <8 x half> @llvm.minimum.v8f16(<8 x half> undef, <8 x half> undef)
+; GFX9-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v16f16 = call <16 x half> @llvm.minimum.v16f16(<16 x half> undef, <16 x half> undef)
 ; GFX9-NEXT:  Cost Model: Found an estimated cost of 10 for instruction: ret void
 ;
 ; SLOWF64-LABEL: 'minimum_f16'
@@ -38,10 +38,10 @@ define void @minimum_f16() {
 ; GFX9-SIZE-LABEL: 'minimum_f16'
 ; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %f16 = call half @llvm.minimum.f16(half undef, half undef)
 ; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 3 for instruction: %v2f16 = call <2 x half> @llvm.minimum.v2f16(<2 x half> undef, <2 x half> undef)
-; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 5 for instruction: %v3f16 = call <3 x half> @llvm.minimum.v3f16(<3 x half> undef, <3 x half> undef)
-; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 7 for instruction: %v4f16 = call <4 x half> @llvm.minimum.v4f16(<4 x half> undef, <4 x half> undef)
-; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 15 for instruction: %v8f16 = call <8 x half> @llvm.minimum.v8f16(<8 x half> undef, <8 x half> undef)
-; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 31 for instruction: %v16f16 = call <16 x half> @llvm.minimum.v16f16(<16 x half> undef, <16 x half> undef)
+; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v3f16 = call <3 x half> @llvm.minimum.v3f16(<3 x half> undef, <3 x half> undef)
+; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v4f16 = call <4 x half> @llvm.minimum.v4f16(<4 x half> undef, <4 x half> undef)
+; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v8f16 = call <8 x half> @llvm.minimum.v8f16(<8 x half> undef, <8 x half> undef)
+; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %v16f16 = call <16 x half> @llvm.minimum.v16f16(<16 x half> undef, <16 x half> undef)
 ; GFX9-SIZE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: ret void
 ;
 ; SLOW-SIZE-LABEL: 'minimum_f16'
diff --git a/llvm/test/CodeGen/AMDGPU/fmaximum3.ll b/llvm/test/CodeGen/AMDGPU/fmaximum3.ll
index 567202be69fa6..53d940e1e6c1a 100644
--- a/llvm/test/CodeGen/AMDGPU/fmaximum3.ll
+++ b/llvm/test/CodeGen/AMDGPU/fmaximum3.ll
@@ -2375,21 +2375,21 @@ define <3 x half> @v_fmaximum3_v3f16(<3 x half> %a, <3 x half> %b, <3 x half> %c
 ; GFX942-NEXT:    v_cndmask_b32_e32 v8, v7, v6, vcc
 ; GFX942-NEXT:    v_lshrrev_b32_e32 v6, 16, v6
 ; GFX942-NEXT:    v_cmp_o_f16_sdwa vcc, v0, v2 src0_sel:WORD_1 src1_sel:WORD_1
-; GFX942-NEXT:    v_pk_max_f16 v2, v1, v3
-; GFX942-NEXT:    s_nop 0
+; GFX942-NEXT:    s_nop 1
 ; GFX942-NEXT:    v_cndmask_b32_e32 v0, v7, v6, vcc
+; GFX942-NEXT:    v_pk_max_f16 v6, v1, v3
 ; GFX942-NEXT:    v_cmp_o_f16_e32 vcc, v1, v3
-; GFX942-NEXT:    s_nop 1
-; GFX942-NEXT:    v_cndmask_b32_e32 v6, v7, v2, vcc
-; GFX942-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GFX942-NEXT:    v_perm_b32 v2, v0, v8, s0
+; GFX942-NEXT:    v_pk_max_f16 v2, v4, v2
+; GFX942-NEXT:    v_cndmask_b32_e32 v9, v7, v6, vcc
+; GFX942-NEXT:    v_lshrrev_b32_e32 v6, 16, v6
 ; GFX942-NEXT:    v_cmp_o_f16_sdwa vcc, v1, v3 src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX942-NEXT:    s_nop 1
-; GFX942-NEXT:    v_cndmask_b32_e32 v1, v7, v2, vcc
-; GFX942-NEXT:    v_perm_b32 v1, v1, v6, s0
+; GFX942-NEXT:    v_cndmask_b32_e32 v1, v7, v6, vcc
+; GFX942-NEXT:    v_perm_b32 v1, v1, v9, s0
 ; GFX942-NEXT:    v_pk_max_f16 v1, v5, v1
-; GFX942-NEXT:    v_cmp_o_f16_e32 vcc, v5, v6
-; GFX942-NEXT:    v_perm_b32 v2, v0, v8, s0
-; GFX942-NEXT:    v_pk_max_f16 v2, v4, v2
+; GFX942-NEXT:    v_cmp_o_f16_e32 vcc, v5, v9
+; GFX942-NEXT:    s_nop 1
 ; GFX942-NEXT:    v_cndmask_b32_e32 v1, v7, v1, vcc
 ; GFX942-NEXT:    v_cmp_o_f16_e32 vcc, v4, v8
 ; GFX942-NEXT:    s_nop 1
@@ -2437,21 +2437,21 @@ define <3 x half> @v_fmaximum3_v3f16_commute(<3 x half> %a, <3 x half> %b, <3 x
 ; GFX942-NEXT:    v_cndmask_b32_e32 v8, v7, v6, vcc
 ; GFX942-NEXT:    v_lshrrev_b32_e32 v6, 16, v6
 ; GFX942-NEXT:    v_cmp_o_f16_sdwa vcc, v0, v2 src0_sel:WORD_1 src1_sel:WORD_1
-; GFX942-NEXT:    v_pk_max_f16 v2, v1, v3
-; GFX942-NEXT:    s_nop 0
+; GFX942-NEXT:    s_nop 1
 ; GFX942-NEXT:    v_cndmask_b32_e32 v0, v7, v6, vcc
+; GFX942-NEXT:    v_pk_max_f16 v6, v1, v3
 ; GFX942-NEXT:    v_cmp_o_f16_e32 vcc, v1, v3
-; GFX942-NEXT:    s_nop 1
-; GFX942-NEXT:    v_cndmask_b32_e32 v6, v7, v2, vcc
-; GFX942-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GFX942-NEXT:    v_perm_b32 v2, v0, v8, s0
+; GFX942-NEXT:    v_pk_max_f16 v2, v2, v4
+; GFX942-NEXT:    v_cndmask_b32_e32 v9, v7, v6, vcc
+; GFX942-NEXT:    v_lshrrev_b32_e32 v6, 16, v6
 ; GFX942-NEXT:    v_cmp_o_f16_sdwa vcc, v1, v3 src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX942-NEXT:    s_nop 1
-; GFX942-NEXT:    v_cndmask_b32_e32 v1, v7, v2, vcc
-; GFX942-NEXT:    v_perm_b32 v1, v1, v6, s0
+; GFX942-NEXT:    v_cndmask_b32_e32 v1, v7, v6, vcc
+; GFX942-NEXT:    v_perm_b32 v1, v1, v9, s0
 ; GFX942-NEXT:    v_pk_max_f16 v1, v1, v5
-; GFX942-NEXT:    v_cmp_o_f16_e32 vcc, v6, v5
-; GFX942-NEXT:    v_perm_b32 v2, v0, v8, s0
-; GFX942-NEXT:    v_pk_max_f16 v2, v2, v4
+; GFX942-NEXT:    v_cmp_o_f16_e32 vcc, v9, v5
+; GFX942-NEXT:    s_nop 1
 ; GFX942-NEXT:    v_cndmask_b32_e32 v1, v7, v1, vcc
 ; GFX942-NEXT:    v_cmp_o_f16_e32 vcc, v8, v4
 ; GFX942-NEXT:    s_nop 1
@@ -2500,40 +2500,40 @@ define <3 x half> @v_fmaximum3_v3f16__fabs_all(<3 x half> %a, <3 x half> %b, <3
 ; GFX942-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX942-NEXT:    v_and_b32_e32 v7, 0x7fff7fff, v1
 ; GFX942-NEXT:    v_and_b32_e32 v9, 0x7fff7fff, v3
+; GFX942-NEXT:    v_pk_max_f16 v7, v7, v9
 ; GFX942-NEXT:    v_and_b32_e32 v6, 0x7fff7fff, v0
 ; GFX942-NEXT:    v_and_b32_e32 v8, 0x7fff7fff, v2
-; GFX942-NEXT:    v_pk_max_f16 v7, v7, v9
-; GFX942-NEXT:    v_mov_b32_e32 v12, 0x7e00
 ; GFX942-NEXT:    v_lshrrev_b32_e32 v9, 16, v7
+; GFX942-NEXT:    v_mov_b32_e32 v12, 0x7e00
 ; GFX942-NEXT:    v_cmp_o_f16_sdwa vcc, |v1|, |v3| src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX942-NEXT:    v_pk_max_f16 v6, v6, v8
 ; GFX942-NEXT:    s_mov_b32 s0, 0x5040100
 ; GFX942-NEXT:    v_cndmask_b32_e32 v9, v12, v9, vcc
-; GFX942-NEXT:    v_lshrrev_b32_e32 v8, 16, v6
-; GFX942-NEXT:    v_cmp_o_f16_sdwa vcc, |v0|, |v2| src0_sel:WORD_1 src1_sel:WORD_1
+; GFX942-NEXT:    v_cmp_o_f16_e64 vcc, |v1|, |v3|
 ; GFX942-NEXT:    v_and_b32_e32 v11, 0x7fff7fff, v4
 ; GFX942-NEXT:    v_and_b32_e32 v10, 0x7fff7fff, v5
-; GFX942-NEXT:    v_cndmask_b32_e32 v8, v12, v8, vcc
-; GFX942-NEXT:    v_cmp_o_f16_e64 vcc, |v1|, |v3|
-; GFX942-NEXT:    s_nop 1
 ; GFX942-NEXT:    v_cndmask_b32_e32 v1, v12, v7, vcc
+; GFX942-NEXT:    v_lshrrev_b32_e32 v7, 16, v6
+; GFX942-NEXT:    v_cmp_o_f16_sdwa vcc, |v0|, |v2| src0_sel:WORD_1 src1_sel:WORD_1
+; GFX942-NEXT:    v_perm_b32 v3, v9, v1, s0
+; GFX942-NEXT:    v_pk_max_f16 v3, v3, v10
+; GFX942-NEXT:    v_cndmask_b32_e32 v7, v12, v7, vcc
 ; GFX942-NEXT:    v_cmp_o_f16_e64 vcc, |v0|, |v2|
 ; GFX942-NEXT:    s_nop 1
 ; GFX942-NEXT:    v_cndmask_b32_e32 v0, v12, v6, vcc
-; GFX942-NEXT:    v_perm_b32 v2, v8, v0, s0
+; GFX942-NEXT:    v_perm_b32 v2, v7, v0, s0
 ; GFX942-NEXT:    v_pk_max_f16 v2, v2, v11
-; GFX942-NEXT:    v_cmp_o_f16_sdwa vcc, v8, |v4| src0_sel:DWORD src1_sel:WORD_1
-; GFX942-NEXT:    v_lshrrev_b32_e32 v3, 16, v2
-; GFX942-NEXT:    v_perm_b32 v6, v9, v1, s0
-; GFX942-NEXT:    v_cndmask_b32_e32 v3, v12, v3, vcc
-; GFX942-NEXT:    v_pk_max_f16 v6, v6, v10
+; GFX942-NEXT:    v_cmp_o_f16_sdwa vcc, v7, |v4| src0_sel:DWORD src1_sel:WORD_1
+; GFX942-NEXT:    v_lshrrev_b32_e32 v6, 16, v2
+; GFX942-NEXT:    s_nop 0
+; GFX942-NEXT:    v_cndmask_b32_e32 v6, v12, v6, vcc
 ; GFX942-NEXT:    v_cmp_o_f16_e64 vcc, v1, |v5|
 ; GFX942-NEXT:    s_nop 1
-; GFX942-NEXT:    v_cndmask_b32_e32 v1, v12, v6, vcc
+; GFX942-NEXT:    v_cndmask_b32_e32 v1, v12, v3, vcc
 ; GFX942-NEXT:    v_cmp_o_f16_e64 vcc, v0, |v4|
 ; GFX942-NEXT:    s_nop 1
 ; GFX942-NEXT:    v_cndmask_b32_e32 v0, v12, v2, vcc
-; GFX942-NEXT:    v_perm_b32 v0, v3, v0, s0
+; GFX942-NEXT:    v_perm_b32 v0, v6, v0, s0
 ; GFX942-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX950-LABEL: v_fmaximum3_v3f16__fabs_all:
@@ -2582,21 +2582,21 @@ define <3 x half> @v_fmaximum3_v3f16__fneg_all(<3 x half> %a, <3 x half> %b, <3
 ; GFX942-NEXT:    v_cndmask_b32_e32 v8, v7, v6, vcc
 ; GFX942-NEXT:    v_lshrrev_b32_e32 v6, 16, v6
 ; GFX942-NEXT:    v_cmp_o_f16_sdwa vcc, -v0, -v2 src0_sel:WORD_1 src1_sel:WORD_1
-; GFX942-NEXT:    v_pk_max_f16 v2, v1, v3 neg_lo:[1,1] neg_hi:[1,1]
-; GFX942-NEXT:    s_nop 0
+; GFX942-NEXT:    s_nop 1
 ; GFX942-NEXT:    v_cndmask_b32_e32 v0, v7, v6, vcc
+; GFX942-NEXT:    v_pk_max_f16 v6, v1, v3 neg_lo:[1,1] neg_hi:[1,1]
 ; GFX942-NEXT:    v_cmp_o_f16_e64 vcc, -v1, -v3
-; GFX942-NEXT:    s_nop 1
-; GFX942-NEXT:    v_cndmask_b32_e32 v6, v7, v2, vcc
-; GFX942-NEXT:    v_lshrrev_b32_e32 v2, 16, v2
+; GFX942-NEXT:    v_perm_b32 v2, v0, v8, s0
+; GFX942-NEXT:    v_pk_max_f16 v2, v2, v4 neg_lo:[0,1] neg_hi:[0,1]
+; GFX942-NEXT:    v_cndmask_b32_e32 v9, v7, v6, vcc
+; GFX942-NEXT:    v_lshrrev_b32_e32 v6, 16, v6
 ; GFX942-NEXT:    v_cmp_o_f16_sdwa vcc, -v1, -v3 src0_sel:WORD_1 src1_sel:WORD_1
 ; GFX942-NEXT:    s_nop 1
-; GFX942-NEXT:    v_cndmask_b32_e32 v1, v7, v2, vcc
-; GFX942-NEXT:    v_perm_b32 v1, v1, v6, s0
+; GFX942-NEXT:    v_cndmask_b32_e32 v1, v7, v6, vcc
+; GFX942-NEXT:    v_perm_b32 v1, v1, v9, s0
 ; GFX942-NEXT:    v_pk_max_f16 v1, v1, v5 neg_lo:[0,1] neg_hi:[0,1]
-; GFX942-NEXT:    v_cmp_o_f16_e64 vcc, v6, -v5
-; GFX942-NEXT:    v_perm_b32 v2, v0, v8, s0
-; GFX942-NEXT:    v_pk_max_f16 v2, v2, v4 neg_lo:[0,1] neg_hi:[0,1]
+; GFX942-NEXT:    v_cmp_o_f16_e64 vcc, v9, -v5
+; GFX942-NEXT:    s_nop 1
 ; GFX942-NEXT:    v_cndmask_b32_e32 v1, v7, v1, vcc
 ; GFX942-NEXT:    v_cmp_o_f16_e64 vcc, v8, -v4
 ; GFX942-NEXT:    s_nop 1
@@ -2643,22 +2643,21 @@ define <3 x half> @v_fmaximum3_v3f16__inlineimm1(<3 x half> %a, <3 x half> %c) {
 ; GFX942-NEXT:    v_pk_max_f16 v4, v0, 2.0 op_sel_hi:[1,0]
 ; GFX942-NEXT:    v_mov_b32_e32 v5, 0x7e00
 ; GFX942-NEXT:    v_cndmask_b32_sdwa v6, v5, v4, vcc dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
-; GFX942-NEX...
[truncated]

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was already upstreamed in e57b327

; GFX942-NEXT: v_cmp_o_f16_e32 vcc, v4, v4
; GFX942-NEXT: v_cndmask_b32_sdwa v1, v5, v4, vcc dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_1
; GFX942-NEXT: v_perm_b32 v3, v1, v7, s0
; GFX942-NEXT: v_pk_min_f16 v3, v3, 4.0 op_sel_hi:[1,0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test changes are all for gfx942, something else is going on here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It now exactly matches downstream though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this version it is not even gfx950 specific.

if (Subtarget->hasMinimum3Maximum3F32())
setOperationAction({ISD::FMAXIMUM, ISD::FMINIMUM}, MVT::f32, Legal);

if (Subtarget->hasMinimum3Maximum3PKF16()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same code is added in L888 which is under else of if (Subtarget->hasVOP3PInsts()). Probably we should remove code from L888 and L892

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was misplaced. It is just an impossible combination to have no VOP3P and have f16 and min3/max3. I have removed the whole else block and nothing did change.

@rampitec
Copy link
Collaborator Author

rampitec commented May 8, 2025

This was already upstreamed in e57b327

Only for f32.

Original patches by Matthew Arsenault.
@rampitec rampitec force-pushed the users/rampitec/05-07-_amdgpu_legalize_fminimum_and_fmaximum_for_gfx950 branch from 8bad4c6 to 6d72957 Compare May 8, 2025 19:36
@rampitec rampitec changed the title [AMDGPU] Legalize fminimum and fmaximum for gfx950 [AMDGPU] Legalize vector fminimum and fmaximum with VOP3P May 8, 2025
@rampitec rampitec merged commit d2c5fbe into main May 9, 2025
9 of 11 checks passed
@rampitec rampitec deleted the users/rampitec/05-07-_amdgpu_legalize_fminimum_and_fmaximum_for_gfx950 branch May 9, 2025 05:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants