Skip to content

AMDGPU][True16][CodeGen] FP_Round f64 to f16 in true16 #128911

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 8, 2025

Conversation

broxigarchen
Copy link
Contributor

@broxigarchen broxigarchen commented Feb 26, 2025

Update the f64 to f16 lowering for targets which support f16 types.

For unsafe mode, lowered to two FP_ROUND. (This patch https://reviews.llvm.org/D154528 stops from combining these two FP_ROUND back). In safe mode, select LowerF64ToF16 (round-to-nearest-even rounding mode)

@broxigarchen broxigarchen changed the title gisel lower fp64 to fp16 AMDGPU][True16][CodeGen] lower FP_Round in true16 Mar 6, 2025
@broxigarchen broxigarchen changed the title AMDGPU][True16][CodeGen] lower FP_Round in true16 AMDGPU][True16][CodeGen] fptrunc f64 to f16 in true16 Mar 6, 2025
@broxigarchen broxigarchen marked this pull request as ready for review March 6, 2025 16:46
@broxigarchen broxigarchen requested a review from arsenm March 6, 2025 16:46
@llvmbot
Copy link
Member

llvmbot commented Mar 6, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Brox Chen (broxigarchen)

Changes

fptrunc from f64 to f16 in true16 mode


Patch is 62.05 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/128911.diff

5 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp (+10-3)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h (+3)
  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+11)
  • (modified) llvm/test/CodeGen/AMDGPU/fptrunc.f16.ll (+663-327)
  • (modified) llvm/test/CodeGen/AMDGPU/fptrunc.ll (+51-25)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index ade81f17ecca5..dca2983f94be8 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -3578,15 +3578,22 @@ SDValue AMDGPUTargetLowering::LowerFP_TO_FP16(SDValue Op, SelectionDAG &DAG) con
     return SDValue();
   }
 
-  assert(N0.getSimpleValueType() == MVT::f64);
+  return LowerF64ToF16(N0, Op.getValueType(), DL, DAG);
+}
+
+SDValue AMDGPUTargetLowering::LowerF64ToF16(SDValue Src, EVT ResTy,
+                                            const SDLoc &DL,
+                                            SelectionDAG &DAG) const {
+  assert(Src.getSimpleValueType() == MVT::f64);
 
   // f64 -> f16 conversion using round-to-nearest-even rounding mode.
+  // TODO: We can generate better code for True16.
   const unsigned ExpMask = 0x7ff;
   const unsigned ExpBiasf64 = 1023;
   const unsigned ExpBiasf16 = 15;
   SDValue Zero = DAG.getConstant(0, DL, MVT::i32);
   SDValue One = DAG.getConstant(1, DL, MVT::i32);
-  SDValue U = DAG.getNode(ISD::BITCAST, DL, MVT::i64, N0);
+  SDValue U = DAG.getNode(ISD::BITCAST, DL, MVT::i64, Src);
   SDValue UH = DAG.getNode(ISD::SRL, DL, MVT::i64, U,
                            DAG.getConstant(32, DL, MVT::i64));
   UH = DAG.getZExtOrTrunc(UH, DL, MVT::i32);
@@ -3661,7 +3668,7 @@ SDValue AMDGPUTargetLowering::LowerFP_TO_FP16(SDValue Op, SelectionDAG &DAG) con
                      DAG.getConstant(0x8000, DL, MVT::i32));
 
   V = DAG.getNode(ISD::OR, DL, MVT::i32, Sign, V);
-  return DAG.getZExtOrTrunc(V, DL, Op.getValueType());
+  return DAG.getZExtOrTrunc(V, DL, ResTy);
 }
 
 SDValue AMDGPUTargetLowering::LowerFP_TO_INT(const SDValue Op,
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h
index c74dc7942f52c..7b73b7b6a7127 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h
@@ -97,6 +97,9 @@ class AMDGPUTargetLowering : public TargetLowering {
   SDValue LowerFP_TO_FP16(SDValue Op, SelectionDAG &DAG) const;
   SDValue LowerFP_TO_INT(SDValue Op, SelectionDAG &DAG) const;
 
+  SDValue LowerF64ToF16(SDValue Src, EVT ResTy, const SDLoc &DL,
+                        SelectionDAG &DAG) const;
+
   SDValue LowerSIGN_EXTEND_INREG(SDValue Op, SelectionDAG &DAG) const;
 
 protected:
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index fe095414e5172..1c9c8422e2257 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -6825,6 +6825,17 @@ SDValue SITargetLowering::lowerFP_ROUND(SDValue Op, SelectionDAG &DAG) const {
 
   SDLoc DL(Op);
 
+  if (Subtarget->useRealTrue16Insts()) {
+    if (getTargetMachine().Options.UnsafeFPMath) {
+      SDValue Flags = Op.getOperand(1);
+      SDValue Src32 = DAG.getNode(ISD::FP_ROUND, DL, MVT::f32, Src, Flags);
+      return DAG.getNode(ISD::FP_ROUND, DL, MVT::f16, Src32, Flags);
+    }
+
+    SDValue FpToFp16 = LowerF64ToF16(Src, MVT::i16, DL, DAG);
+    return DAG.getNode(ISD::BITCAST, DL, MVT::f16, FpToFp16);
+  }
+
   SDValue FpToFp16 = DAG.getNode(ISD::FP_TO_FP16, DL, MVT::i32, Src);
   SDValue Trunc = DAG.getNode(ISD::TRUNCATE, DL, MVT::i16, FpToFp16);
   return DAG.getNode(ISD::BITCAST, DL, MVT::f16, Trunc);
diff --git a/llvm/test/CodeGen/AMDGPU/fptrunc.f16.ll b/llvm/test/CodeGen/AMDGPU/fptrunc.f16.ll
index 188b2ada64686..9057c81cbd555 100644
--- a/llvm/test/CodeGen/AMDGPU/fptrunc.f16.ll
+++ b/llvm/test/CodeGen/AMDGPU/fptrunc.f16.ll
@@ -7,8 +7,10 @@
 ; RUN: llc -amdgpu-scalarize-global-loads=false -mtriple=amdgcn -mcpu=gfx900 -global-isel=1 -mattr=-flat-for-global -denormal-fp-math=preserve-sign -verify-machineinstrs -enable-unsafe-fp-math < %s | FileCheck -enable-var-scope -check-prefixes=GFX9-GISEL %s
 ; RUN: llc -amdgpu-scalarize-global-loads=false -mtriple=amdgcn -mcpu=gfx950 -global-isel=0 -mattr=-flat-for-global -denormal-fp-math=preserve-sign -verify-machineinstrs -enable-unsafe-fp-math < %s | FileCheck -enable-var-scope -check-prefixes=GFX950-SDAG %s
 ; RUN: llc -amdgpu-scalarize-global-loads=false -mtriple=amdgcn -mcpu=gfx950 -global-isel=1 -mattr=-flat-for-global -denormal-fp-math=preserve-sign -verify-machineinstrs -enable-unsafe-fp-math < %s | FileCheck -enable-var-scope -check-prefixes=GFX950-GISEL %s
-; RUN: llc -amdgpu-scalarize-global-loads=false -mtriple=amdgcn -mcpu=gfx1100 -global-isel=0 -mattr=-flat-for-global -denormal-fp-math=preserve-sign -verify-machineinstrs -enable-unsafe-fp-math < %s | FileCheck -enable-var-scope -check-prefixes=GFX11-SDAG %s
-; RUN: llc -amdgpu-scalarize-global-loads=false -mtriple=amdgcn -mcpu=gfx1100 -global-isel=1 -mattr=-flat-for-global -denormal-fp-math=preserve-sign -verify-machineinstrs -enable-unsafe-fp-math < %s | FileCheck -enable-var-scope -check-prefixes=GFX11-GISEL %s
+; RUN: llc -amdgpu-scalarize-global-loads=false -mtriple=amdgcn -mcpu=gfx1100 -global-isel=0 -mattr=-flat-for-global,+real-true16 -denormal-fp-math=preserve-sign -verify-machineinstrs -enable-unsafe-fp-math < %s | FileCheck -enable-var-scope -check-prefixes=GFX11-SDAG-TRUE16 %s
+; RUN: llc -amdgpu-scalarize-global-loads=false -mtriple=amdgcn -mcpu=gfx1100 -global-isel=0 -mattr=-flat-for-global,-real-true16 -denormal-fp-math=preserve-sign -verify-machineinstrs -enable-unsafe-fp-math < %s | FileCheck -enable-var-scope -check-prefixes=GFX11-SDAG-FAKE16 %s
+; RUN: llc -amdgpu-scalarize-global-loads=false -mtriple=amdgcn -mcpu=gfx1100 -global-isel=1 -mattr=-flat-for-global,+real-true16 -denormal-fp-math=preserve-sign -verify-machineinstrs -enable-unsafe-fp-math < %s | FileCheck -enable-var-scope -check-prefixes=GFX11-GISEL-TRUE16 %s
+; RUN: llc -amdgpu-scalarize-global-loads=false -mtriple=amdgcn -mcpu=gfx1100 -global-isel=1 -mattr=-flat-for-global,-real-true16 -denormal-fp-math=preserve-sign -verify-machineinstrs -enable-unsafe-fp-math < %s | FileCheck -enable-var-scope -check-prefixes=GFX11-GISEL-FAKE16 %s
 
 define amdgpu_kernel void @fptrunc_f32_to_f16(
 ; SI-SDAG-LABEL: fptrunc_f32_to_f16:
@@ -131,35 +133,65 @@ define amdgpu_kernel void @fptrunc_f32_to_f16(
 ; GFX950-GISEL-NEXT:    buffer_store_short v0, off, s[0:3], 0
 ; GFX950-GISEL-NEXT:    s_endpgm
 ;
-; GFX11-SDAG-LABEL: fptrunc_f32_to_f16:
-; GFX11-SDAG:       ; %bb.0: ; %entry
-; GFX11-SDAG-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
-; GFX11-SDAG-NEXT:    s_mov_b32 s6, -1
-; GFX11-SDAG-NEXT:    s_mov_b32 s7, 0x31016000
-; GFX11-SDAG-NEXT:    s_mov_b32 s10, s6
-; GFX11-SDAG-NEXT:    s_mov_b32 s11, s7
-; GFX11-SDAG-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX11-SDAG-NEXT:    s_mov_b32 s8, s2
-; GFX11-SDAG-NEXT:    s_mov_b32 s9, s3
-; GFX11-SDAG-NEXT:    s_mov_b32 s4, s0
-; GFX11-SDAG-NEXT:    buffer_load_b32 v0, off, s[8:11], 0
-; GFX11-SDAG-NEXT:    s_mov_b32 s5, s1
-; GFX11-SDAG-NEXT:    s_waitcnt vmcnt(0)
-; GFX11-SDAG-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GFX11-SDAG-NEXT:    buffer_store_b16 v0, off, s[4:7], 0
-; GFX11-SDAG-NEXT:    s_endpgm
-;
-; GFX11-GISEL-LABEL: fptrunc_f32_to_f16:
-; GFX11-GISEL:       ; %bb.0: ; %entry
-; GFX11-GISEL-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
-; GFX11-GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX11-GISEL-NEXT:    s_load_b32 s2, s[2:3], 0x0
-; GFX11-GISEL-NEXT:    s_mov_b32 s3, 0x31016000
-; GFX11-GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX11-GISEL-NEXT:    v_cvt_f16_f32_e32 v0, s2
-; GFX11-GISEL-NEXT:    s_mov_b32 s2, -1
-; GFX11-GISEL-NEXT:    buffer_store_b16 v0, off, s[0:3], 0
-; GFX11-GISEL-NEXT:    s_endpgm
+; GFX11-SDAG-TRUE16-LABEL: fptrunc_f32_to_f16:
+; GFX11-SDAG-TRUE16:       ; %bb.0: ; %entry
+; GFX11-SDAG-TRUE16-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
+; GFX11-SDAG-TRUE16-NEXT:    s_mov_b32 s6, -1
+; GFX11-SDAG-TRUE16-NEXT:    s_mov_b32 s7, 0x31016000
+; GFX11-SDAG-TRUE16-NEXT:    s_mov_b32 s10, s6
+; GFX11-SDAG-TRUE16-NEXT:    s_mov_b32 s11, s7
+; GFX11-SDAG-TRUE16-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-SDAG-TRUE16-NEXT:    s_mov_b32 s8, s2
+; GFX11-SDAG-TRUE16-NEXT:    s_mov_b32 s9, s3
+; GFX11-SDAG-TRUE16-NEXT:    s_mov_b32 s4, s0
+; GFX11-SDAG-TRUE16-NEXT:    buffer_load_b32 v0, off, s[8:11], 0
+; GFX11-SDAG-TRUE16-NEXT:    s_mov_b32 s5, s1
+; GFX11-SDAG-TRUE16-NEXT:    s_waitcnt vmcnt(0)
+; GFX11-SDAG-TRUE16-NEXT:    v_cvt_f16_f32_e32 v0.l, v0
+; GFX11-SDAG-TRUE16-NEXT:    buffer_store_b16 v0, off, s[4:7], 0
+; GFX11-SDAG-TRUE16-NEXT:    s_endpgm
+;
+; GFX11-SDAG-FAKE16-LABEL: fptrunc_f32_to_f16:
+; GFX11-SDAG-FAKE16:       ; %bb.0: ; %entry
+; GFX11-SDAG-FAKE16-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
+; GFX11-SDAG-FAKE16-NEXT:    s_mov_b32 s6, -1
+; GFX11-SDAG-FAKE16-NEXT:    s_mov_b32 s7, 0x31016000
+; GFX11-SDAG-FAKE16-NEXT:    s_mov_b32 s10, s6
+; GFX11-SDAG-FAKE16-NEXT:    s_mov_b32 s11, s7
+; GFX11-SDAG-FAKE16-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-SDAG-FAKE16-NEXT:    s_mov_b32 s8, s2
+; GFX11-SDAG-FAKE16-NEXT:    s_mov_b32 s9, s3
+; GFX11-SDAG-FAKE16-NEXT:    s_mov_b32 s4, s0
+; GFX11-SDAG-FAKE16-NEXT:    buffer_load_b32 v0, off, s[8:11], 0
+; GFX11-SDAG-FAKE16-NEXT:    s_mov_b32 s5, s1
+; GFX11-SDAG-FAKE16-NEXT:    s_waitcnt vmcnt(0)
+; GFX11-SDAG-FAKE16-NEXT:    v_cvt_f16_f32_e32 v0, v0
+; GFX11-SDAG-FAKE16-NEXT:    buffer_store_b16 v0, off, s[4:7], 0
+; GFX11-SDAG-FAKE16-NEXT:    s_endpgm
+;
+; GFX11-GISEL-TRUE16-LABEL: fptrunc_f32_to_f16:
+; GFX11-GISEL-TRUE16:       ; %bb.0: ; %entry
+; GFX11-GISEL-TRUE16-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
+; GFX11-GISEL-TRUE16-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-GISEL-TRUE16-NEXT:    s_load_b32 s2, s[2:3], 0x0
+; GFX11-GISEL-TRUE16-NEXT:    s_mov_b32 s3, 0x31016000
+; GFX11-GISEL-TRUE16-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-GISEL-TRUE16-NEXT:    v_cvt_f16_f32_e32 v0.l, s2
+; GFX11-GISEL-TRUE16-NEXT:    s_mov_b32 s2, -1
+; GFX11-GISEL-TRUE16-NEXT:    buffer_store_b16 v0, off, s[0:3], 0
+; GFX11-GISEL-TRUE16-NEXT:    s_endpgm
+;
+; GFX11-GISEL-FAKE16-LABEL: fptrunc_f32_to_f16:
+; GFX11-GISEL-FAKE16:       ; %bb.0: ; %entry
+; GFX11-GISEL-FAKE16-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
+; GFX11-GISEL-FAKE16-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-GISEL-FAKE16-NEXT:    s_load_b32 s2, s[2:3], 0x0
+; GFX11-GISEL-FAKE16-NEXT:    s_mov_b32 s3, 0x31016000
+; GFX11-GISEL-FAKE16-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-GISEL-FAKE16-NEXT:    v_cvt_f16_f32_e32 v0, s2
+; GFX11-GISEL-FAKE16-NEXT:    s_mov_b32 s2, -1
+; GFX11-GISEL-FAKE16-NEXT:    buffer_store_b16 v0, off, s[0:3], 0
+; GFX11-GISEL-FAKE16-NEXT:    s_endpgm
     ptr addrspace(1) %r,
     ptr addrspace(1) %a) {
 entry:
@@ -298,39 +330,73 @@ define amdgpu_kernel void @fptrunc_f64_to_f16(
 ; GFX950-GISEL-NEXT:    buffer_store_short v0, off, s[0:3], 0
 ; GFX950-GISEL-NEXT:    s_endpgm
 ;
-; GFX11-SDAG-LABEL: fptrunc_f64_to_f16:
-; GFX11-SDAG:       ; %bb.0: ; %entry
-; GFX11-SDAG-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
-; GFX11-SDAG-NEXT:    s_mov_b32 s6, -1
-; GFX11-SDAG-NEXT:    s_mov_b32 s7, 0x31016000
-; GFX11-SDAG-NEXT:    s_mov_b32 s10, s6
-; GFX11-SDAG-NEXT:    s_mov_b32 s11, s7
-; GFX11-SDAG-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX11-SDAG-NEXT:    s_mov_b32 s8, s2
-; GFX11-SDAG-NEXT:    s_mov_b32 s9, s3
-; GFX11-SDAG-NEXT:    s_mov_b32 s4, s0
-; GFX11-SDAG-NEXT:    buffer_load_b64 v[0:1], off, s[8:11], 0
-; GFX11-SDAG-NEXT:    s_mov_b32 s5, s1
-; GFX11-SDAG-NEXT:    s_waitcnt vmcnt(0)
-; GFX11-SDAG-NEXT:    v_cvt_f32_f64_e32 v0, v[0:1]
-; GFX11-SDAG-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX11-SDAG-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GFX11-SDAG-NEXT:    buffer_store_b16 v0, off, s[4:7], 0
-; GFX11-SDAG-NEXT:    s_endpgm
-;
-; GFX11-GISEL-LABEL: fptrunc_f64_to_f16:
-; GFX11-GISEL:       ; %bb.0: ; %entry
-; GFX11-GISEL-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
-; GFX11-GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX11-GISEL-NEXT:    s_load_b64 s[2:3], s[2:3], 0x0
-; GFX11-GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX11-GISEL-NEXT:    v_cvt_f32_f64_e32 v0, s[2:3]
-; GFX11-GISEL-NEXT:    s_mov_b32 s2, -1
-; GFX11-GISEL-NEXT:    s_mov_b32 s3, 0x31016000
-; GFX11-GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX11-GISEL-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GFX11-GISEL-NEXT:    buffer_store_b16 v0, off, s[0:3], 0
-; GFX11-GISEL-NEXT:    s_endpgm
+; GFX11-SDAG-TRUE16-LABEL: fptrunc_f64_to_f16:
+; GFX11-SDAG-TRUE16:       ; %bb.0: ; %entry
+; GFX11-SDAG-TRUE16-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
+; GFX11-SDAG-TRUE16-NEXT:    s_mov_b32 s6, -1
+; GFX11-SDAG-TRUE16-NEXT:    s_mov_b32 s7, 0x31016000
+; GFX11-SDAG-TRUE16-NEXT:    s_mov_b32 s10, s6
+; GFX11-SDAG-TRUE16-NEXT:    s_mov_b32 s11, s7
+; GFX11-SDAG-TRUE16-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-SDAG-TRUE16-NEXT:    s_mov_b32 s8, s2
+; GFX11-SDAG-TRUE16-NEXT:    s_mov_b32 s9, s3
+; GFX11-SDAG-TRUE16-NEXT:    s_mov_b32 s4, s0
+; GFX11-SDAG-TRUE16-NEXT:    buffer_load_b64 v[0:1], off, s[8:11], 0
+; GFX11-SDAG-TRUE16-NEXT:    s_mov_b32 s5, s1
+; GFX11-SDAG-TRUE16-NEXT:    s_waitcnt vmcnt(0)
+; GFX11-SDAG-TRUE16-NEXT:    v_cvt_f32_f64_e32 v0, v[0:1]
+; GFX11-SDAG-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX11-SDAG-TRUE16-NEXT:    v_cvt_f16_f32_e32 v0.l, v0
+; GFX11-SDAG-TRUE16-NEXT:    buffer_store_b16 v0, off, s[4:7], 0
+; GFX11-SDAG-TRUE16-NEXT:    s_endpgm
+;
+; GFX11-SDAG-FAKE16-LABEL: fptrunc_f64_to_f16:
+; GFX11-SDAG-FAKE16:       ; %bb.0: ; %entry
+; GFX11-SDAG-FAKE16-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
+; GFX11-SDAG-FAKE16-NEXT:    s_mov_b32 s6, -1
+; GFX11-SDAG-FAKE16-NEXT:    s_mov_b32 s7, 0x31016000
+; GFX11-SDAG-FAKE16-NEXT:    s_mov_b32 s10, s6
+; GFX11-SDAG-FAKE16-NEXT:    s_mov_b32 s11, s7
+; GFX11-SDAG-FAKE16-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-SDAG-FAKE16-NEXT:    s_mov_b32 s8, s2
+; GFX11-SDAG-FAKE16-NEXT:    s_mov_b32 s9, s3
+; GFX11-SDAG-FAKE16-NEXT:    s_mov_b32 s4, s0
+; GFX11-SDAG-FAKE16-NEXT:    buffer_load_b64 v[0:1], off, s[8:11], 0
+; GFX11-SDAG-FAKE16-NEXT:    s_mov_b32 s5, s1
+; GFX11-SDAG-FAKE16-NEXT:    s_waitcnt vmcnt(0)
+; GFX11-SDAG-FAKE16-NEXT:    v_cvt_f32_f64_e32 v0, v[0:1]
+; GFX11-SDAG-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX11-SDAG-FAKE16-NEXT:    v_cvt_f16_f32_e32 v0, v0
+; GFX11-SDAG-FAKE16-NEXT:    buffer_store_b16 v0, off, s[4:7], 0
+; GFX11-SDAG-FAKE16-NEXT:    s_endpgm
+;
+; GFX11-GISEL-TRUE16-LABEL: fptrunc_f64_to_f16:
+; GFX11-GISEL-TRUE16:       ; %bb.0: ; %entry
+; GFX11-GISEL-TRUE16-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
+; GFX11-GISEL-TRUE16-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-GISEL-TRUE16-NEXT:    s_load_b64 s[2:3], s[2:3], 0x0
+; GFX11-GISEL-TRUE16-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-GISEL-TRUE16-NEXT:    v_cvt_f32_f64_e32 v0, s[2:3]
+; GFX11-GISEL-TRUE16-NEXT:    s_mov_b32 s2, -1
+; GFX11-GISEL-TRUE16-NEXT:    s_mov_b32 s3, 0x31016000
+; GFX11-GISEL-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX11-GISEL-TRUE16-NEXT:    v_cvt_f16_f32_e32 v0.l, v0
+; GFX11-GISEL-TRUE16-NEXT:    buffer_store_b16 v0, off, s[0:3], 0
+; GFX11-GISEL-TRUE16-NEXT:    s_endpgm
+;
+; GFX11-GISEL-FAKE16-LABEL: fptrunc_f64_to_f16:
+; GFX11-GISEL-FAKE16:       ; %bb.0: ; %entry
+; GFX11-GISEL-FAKE16-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
+; GFX11-GISEL-FAKE16-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-GISEL-FAKE16-NEXT:    s_load_b64 s[2:3], s[2:3], 0x0
+; GFX11-GISEL-FAKE16-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-GISEL-FAKE16-NEXT:    v_cvt_f32_f64_e32 v0, s[2:3]
+; GFX11-GISEL-FAKE16-NEXT:    s_mov_b32 s2, -1
+; GFX11-GISEL-FAKE16-NEXT:    s_mov_b32 s3, 0x31016000
+; GFX11-GISEL-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX11-GISEL-FAKE16-NEXT:    v_cvt_f16_f32_e32 v0, v0
+; GFX11-GISEL-FAKE16-NEXT:    buffer_store_b16 v0, off, s[0:3], 0
+; GFX11-GISEL-FAKE16-NEXT:    s_endpgm
     ptr addrspace(1) %r,
     ptr addrspace(1) %a) {
 entry:
@@ -477,41 +543,77 @@ define amdgpu_kernel void @fptrunc_v2f32_to_v2f16(
 ; GFX950-GISEL-NEXT:    buffer_store_dword v0, off, s[0:3], 0
 ; GFX950-GISEL-NEXT:    s_endpgm
 ;
-; GFX11-SDAG-LABEL: fptrunc_v2f32_to_v2f16:
-; GFX11-SDAG:       ; %bb.0: ; %entry
-; GFX11-SDAG-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
-; GFX11-SDAG-NEXT:    s_mov_b32 s6, -1
-; GFX11-SDAG-NEXT:    s_mov_b32 s7, 0x31016000
-; GFX11-SDAG-NEXT:    s_mov_b32 s10, s6
-; GFX11-SDAG-NEXT:    s_mov_b32 s11, s7
-; GFX11-SDAG-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX11-SDAG-NEXT:    s_mov_b32 s8, s2
-; GFX11-SDAG-NEXT:    s_mov_b32 s9, s3
-; GFX11-SDAG-NEXT:    s_mov_b32 s4, s0
-; GFX11-SDAG-NEXT:    buffer_load_b64 v[0:1], off, s[8:11], 0
-; GFX11-SDAG-NEXT:    s_mov_b32 s5, s1
-; GFX11-SDAG-NEXT:    s_waitcnt vmcnt(0)
-; GFX11-SDAG-NEXT:    v_cvt_f16_f32_e32 v1, v1
-; GFX11-SDAG-NEXT:    v_cvt_f16_f32_e32 v0, v0
-; GFX11-SDAG-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX11-SDAG-NEXT:    v_pack_b32_f16 v0, v0, v1
-; GFX11-SDAG-NEXT:    buffer_store_b32 v0, off, s[4:7], 0
-; GFX11-SDAG-NEXT:    s_endpgm
-;
-; GFX11-GISEL-LABEL: fptrunc_v2f32_to_v2f16:
-; GFX11-GISEL:       ; %bb.0: ; %entry
-; GFX11-GISEL-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
-; GFX11-GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX11-GISEL-NEXT:    s_load_b64 s[2:3], s[2:3], 0x0
-; GFX11-GISEL-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX11-GISEL-NEXT:    v_cvt_f16_f32_e32 v0, s2
-; GFX11-GISEL-NEXT:    v_cvt_f16_f32_e32 v1, s3
-; GFX11-GISEL-NEXT:    s_mov_b32 s2, -1
-; GFX11-GISEL-NEXT:    s_mov_b32 s3, 0x31016000
-; GFX11-GISEL-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX11-GISEL-NEXT:    v_pack_b32_f16 v0, v0, v1
-; GFX11-GISEL-NEXT:    buffer_store_b32 v0, off, s[0:3], 0
-; GFX11-GISEL-NEXT:    s_endpgm
+; GFX11-SDAG-TRUE16-LABEL: fptrunc_v2f32_to_v2f16:
+; GFX11-SDAG-TRUE16:       ; %bb.0: ; %entry
+; GFX11-SDAG-TRUE16-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
+; GFX11-SDAG-TRUE16-NEXT:    s_mov_b32 s6, -1
+; GFX11-SDAG-TRUE16-NEXT:    s_mov_b32 s7, 0x31016000
+; GFX11-SDAG-TRUE16-NEXT:    s_mov_b32 s10, s6
+; GFX11-SDAG-TRUE16-NEXT:    s_mov_b32 s11, s7
+; GFX11-SDAG-TRUE16-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-SDAG-TRUE16-NEXT:    s_mov_b32 s8, s2
+; GFX11-SDAG-TRUE16-NEXT:    s_mov_b32 s9, s3
+; GFX11-SDAG-TRUE16-NEXT:    s_mov_b32 s4, s0
+; GFX11-SDAG-TRUE16-NEXT:    buffer_load_b64 v[1:2], off, s[8:11], 0
+; GFX11-SDAG-TRUE16-NEXT:    s_mov_b32 s5, s1
+; GFX11-SDAG-TRUE16-NEXT:    s_waitcnt vmcnt(0)
+; GFX11-SDAG-TRUE16-NEXT:    v_cvt_f16_f32_e32 v0.l, v2
+; GFX11-SDAG-TRUE16-NEXT:    v_cvt_f16_f32_e32 v0.h, v1
+; GFX11-SDAG-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX11-SDAG-TRUE16-NEXT:    v_pack_b32_f16 v0, v0.h, v0.l
+; GFX11-SDAG-TRUE16-NEXT:    buffer_store_b32 v0, off, s[4:7], 0
+; GFX11-SDAG-TRUE16-NEXT:    s_endpgm
+;
+; GFX11-SDAG-FAKE16-LABEL: fptrunc_v2f32_to_v2f16:
+; GFX11-SDAG-FAKE16:       ; %bb.0: ; %entry
+; GFX11-SDAG-FAKE16-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
+; GFX11-SDAG-FAKE16-NEXT:    s_mov_b32 s6, -1
+; GFX11-SDAG-FAKE16-NEXT:    s_mov_b32 s7, 0x31016000
+; GFX11-SDAG-FAKE16-NEXT:    s_mov_b32 s10, s6
+; GFX11-SDAG-FAKE16-NEXT:    s_mov_b32 s11, s7
+; GFX11-SDAG-FAKE16-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-SDAG-FAKE16-NEXT:    s_mov_b32 s8, s2
+; GFX11-SDAG-FAKE16-NEXT:    s_mov_b32 s9, s3
+; GFX11-SDAG-FAKE16-NEXT:    s_mov_b32 s4, s0
+; GFX11-SDAG-FAKE16-NEXT:    buffer_load_b64 v[0:1], off, s[8:11], 0
+; GFX11-SDAG-FAKE16-NEXT:    s_mov_b32 s5, s1
+; GFX11-SDAG-FAKE16-NEXT:    s_waitcnt vmcnt(0)
+; GFX11-SDAG-FAKE16-NEXT:    v_cvt_f16_f32_e32 v1, v1
+; GFX11-SDAG-FAKE16-NEXT:    v_cvt_f16_f32_e32 v0, v0
+; GFX11-SDAG-FAKE16-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX11-SDAG-FAKE16-NEXT:    v_pack_b32_f16 v0, v0, v1
+; GFX11-SDAG-FAKE16-NEXT:    buffer_store_b32 v0, off, s[4:7], 0
+; GFX11-SDAG-FAKE16-NEXT:    s_endpgm
+;
+; GFX11-GISEL-TRUE16-LABEL: fptrunc_v2f32_to_v2f16:
+; GFX11-GISEL-TRUE16:       ; %bb.0: ; %entry
+; GFX11-GISEL-TRUE16-NEXT:    s_load_b128 s[0:3], s[4:5], 0x24
+; GFX11-GISEL-TRUE16-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-GISEL-TRUE16-NEXT:    s_load_b64 s[2:3], s[2:3], 0x0
+; GFX11-GISEL-TRUE16-NEXT:    s_waitcnt lgkmcnt(0)
+; GFX11-GISEL-TRUE16-NEXT:    v_cvt_f16_f32_e32 v0.l, s2
+; GFX11-GISEL-TRUE16-NEXT:    v_cvt_f16_f32_e32 v0.h, s3
+; GFX11-GISEL-TRUE16-NEXT:    s_mov_b32 s2, -1
+; GFX11-GISEL-TRUE16-NEXT:    s_mov_b32 s3, 0x31016000
+; GFX11-GISEL-TRUE16-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX11-GISEL-TRUE16-NEXT:    v_pack_b32_f16 v0, v0.l, v0.h
+; GFX11-GISEL-TRUE16-NEXT:    buffer_store_b32 v0, off, s[0:3], 0
+; GFX11-GISEL-TRUE16-NEXT:    s_endpgm
+;
+; GFX11-GISEL-FAKE16-LABEL: fptrunc_v2f32_to_v2f16:
+; GFX11-GISEL-FAKE16:       ; %bb.0: ...
[truncated]

@broxigarchen broxigarchen changed the title AMDGPU][True16][CodeGen] fptrunc f64 to f16 in true16 AMDGPU][True16][CodeGen] FP_Round f64 to f16 in true16 Mar 6, 2025
Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this change. You shouldn't have to make any of this code conditional on true16. Are you just avoiding the FP_TO_FP16? We can and should do that independently of true16

@broxigarchen
Copy link
Contributor Author

broxigarchen commented Mar 25, 2025

I don't understand this change. You shouldn't have to make any of this code conditional on true16. Are you just avoiding the FP_TO_FP16? We can and should do that independently of true16

Hi Matt. Sorry for the delay. Just checked this PR (I am not the original author). We basically do FP64 lowering differently between true16 and fake16 mode.

The unsafe mode "F64 to F16" is lowered differently between true16/fake16 flow. Fake16 mode lowered to FP_TO_FP16 while in true16 mode it get lowered to two FP_ROUND. (This patch https://reviews.llvm.org/D154528 stops from combining these two FP_ROUND back).

The safe mode lowering is actually the same between true16/fake16. They both eventually go to LowerF64ToF16. But the post lowering operation is changed as in true16 mode we don't need to do fptrunc but just need a bitcast.

@broxigarchen broxigarchen requested a review from arsenm March 26, 2025 14:28
@broxigarchen
Copy link
Contributor Author

Hi @kosarev @jayfoad could you also help to review this? Thanks!

@jayfoad
Copy link
Contributor

jayfoad commented Mar 27, 2025

I don't understand this change. You shouldn't have to make any of this code conditional on true16. Are you just avoiding the FP_TO_FP16? We can and should do that independently of true16

I thought that FP_TO_FP16 should only be used on targets that put fp16 values in 32-bit registers, which AMDGPU traditionally does but True16 does not.

@broxigarchen
Copy link
Contributor Author

broxigarchen commented Mar 31, 2025

I don't understand this change. You shouldn't have to make any of this code conditional on true16. Are you just avoiding the FP_TO_FP16? We can and should do that independently of true16

I thought that FP_TO_FP16 should only be used on targets that put fp16 values in 32-bit registers, which AMDGPU traditionally does but True16 does not.

Hi @arsenm and @jayfoad. Do you think it's good to continue with this implementation? If it's not, please let me know and I am happy to change it here. Thanks for the review!

@jayfoad
Copy link
Contributor

jayfoad commented Apr 1, 2025

I don't understand this change. You shouldn't have to make any of this code conditional on true16. Are you just avoiding the FP_TO_FP16? We can and should do that independently of true16

@arsenm so when should we use FP_TO_FP16? Is it only for targets where f16 is not legal? That would include AMDGPU subtargets where has16BitInsts() is false.

@broxigarchen I think we need to sort this out first, and then revisit the current patch.

@arsenm
Copy link
Contributor

arsenm commented Apr 1, 2025

@arsenm so when should we use FP_TO_FP16? Is it only for targets where f16 is not legal? That would include AMDGPU subtargets where has16BitInsts() is false.

Yes, it's a hack to avoid referring to refer to f16 values as i32. None of this should depend on true16

@broxigarchen
Copy link
Contributor Author

broxigarchen commented Apr 10, 2025

I don't understand this change. You shouldn't have to make any of this code conditional on true16. Are you just avoiding the FP_TO_FP16? We can and should do that independently of true16

@arsenm so when should we use FP_TO_FP16? Is it only for targets where f16 is not legal? That would include AMDGPU subtargets where has16BitInsts() is false.

@broxigarchen I think we need to sort this out first, and then revisit the current patch.

Hi Matt and Jay. Thanks for the explaination! Tried to update this patch according to the comments and please let me know if this still does not sound right

@broxigarchen broxigarchen requested a review from jayfoad April 10, 2025 14:47
@broxigarchen broxigarchen requested a review from kosarev April 11, 2025 16:32
@broxigarchen
Copy link
Contributor Author

gentle ping!

@broxigarchen
Copy link
Contributor Author

Gentle ping again!

Copy link

github-actions bot commented Apr 25, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@broxigarchen
Copy link
Contributor Author

rebased and resolved conflicts

Comment on lines +6902 to +6910
if (!Subtarget->has16BitInsts()) {
SDValue FpToFp16 = DAG.getNode(ISD::FP_TO_FP16, DL, MVT::i32, Src);
SDValue Trunc = DAG.getNode(ISD::TRUNCATE, DL, MVT::i16, FpToFp16);
return DAG.getNode(ISD::BITCAST, DL, MVT::f16, Trunc);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by this. FP_TO_FP16 does the right thing for an f64 source? Should the unsafe path be before this?

Copy link
Contributor Author

@broxigarchen broxigarchen Apr 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Matt. Before this patch, FP_TO_FP16 is used for all targets for FP_ROUND f64 source so I think it suppose to do the right thing. However, I took a quick look at lit tests and we seems don't have any lit test that covers FP_ROUND f64 source in the old targets that do not support f16 types.

So I would say the FP_TO_FP16 here keeps the same code path for old targets. I think it's better to merge as it and If we want to add test coverage or any fixfor old targets we can do it in another patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Matt. Do you have any comment on this? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we seems don't have any lit test that covers FP_ROUND f64 source in the old targets that do not support f16 types.

This is covered, fptrunc.ll has it. fptrunc.f16.ll should also cover it, but that one brokenly only tests with -enable-unsafe-fp-math

It seems to work because we have explicit custom lowering f64 source. Seems like an unnecessary complication over the default expansion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I just realized SI is gfx6. Will create another patch to address these

@broxigarchen broxigarchen requested a review from arsenm April 29, 2025 13:58
@broxigarchen
Copy link
Contributor Author

rebased and squash commits

Comment on lines +6902 to +6910
if (!Subtarget->has16BitInsts()) {
SDValue FpToFp16 = DAG.getNode(ISD::FP_TO_FP16, DL, MVT::i32, Src);
SDValue Trunc = DAG.getNode(ISD::TRUNCATE, DL, MVT::i16, FpToFp16);
return DAG.getNode(ISD::BITCAST, DL, MVT::f16, Trunc);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we seems don't have any lit test that covers FP_ROUND f64 source in the old targets that do not support f16 types.

This is covered, fptrunc.ll has it. fptrunc.f16.ll should also cover it, but that one brokenly only tests with -enable-unsafe-fp-math

It seems to work because we have explicit custom lowering f64 source. Seems like an unnecessary complication over the default expansion

SDValue Src32 = DAG.getNode(ISD::FP_ROUND, DL, MVT::f32, Src, Flags);
return DAG.getNode(ISD::FP_ROUND, DL, MVT::f16, Src32, Flags);
}
SDValue FpToFp16 = LowerF64ToF16Safe(Src, DL, DAG);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still think this is an awkward split

SDValue Trunc = DAG.getNode(ISD::TRUNCATE, DL, MVT::i16, FpToFp16);
return DAG.getNode(ISD::BITCAST, DL, MVT::f16, Trunc);
}
if (getTargetMachine().Options.UnsafeFPMath) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a follow up should move to use the flags from the node

@broxigarchen broxigarchen merged commit 9d907a2 into llvm:main May 8, 2025
11 checks passed
@llvm-ci
Copy link
Collaborator

llvm-ci commented May 9, 2025

LLVM Buildbot has detected a new failure on builder clang-x86_64-debian-fast running on gribozavr4 while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/56/builds/25371

Here is the relevant piece of the build log for the reference
Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'Clangd :: crash-parse.test' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
Content-Length: 3401

{
  "id": 0,
  "jsonrpc": "2.0",
  "result": {
    "capabilities": {
      "astProvider": true,
      "callHierarchyProvider": true,
      "clangdInlayHintsProvider": true,
      "codeActionProvider": true,
      "compilationDatabase": {
        "automaticReload": true
      },
      "completionProvider": {
        "resolveProvider": false,
        "triggerCharacters": [
          ".",
          "<",
          ">",
          ":",
          "\"",
          "/",
          "*"
        ]
      },
      "declarationProvider": true,
      "definitionProvider": true,
      "documentFormattingProvider": true,
      "documentHighlightProvider": true,
      "documentLinkProvider": {
        "resolveProvider": false
      },
      "documentOnTypeFormattingProvider": {
        "firstTriggerCharacter": "\n",
        "moreTriggerCharacter": []
      },
      "documentRangeFormattingProvider": true,
      "documentSymbolProvider": true,
      "executeCommandProvider": {
        "commands": [
          "clangd.applyFix",
          "clangd.applyRename",
          "clangd.applyTweak"
        ]
...

lenary added a commit to lenary/llvm-project that referenced this pull request May 9, 2025
* main: (420 commits)
  [AArch64] Merge scaled and unscaled narrow zero stores (llvm#136705)
  [RISCV] One last migration to getInsertSubvector [nfc]
  [flang][OpenMP] Update `do concurrent` mapping pass to use `fir.do_concurrent` op (llvm#138489)
  [MLIR][LLVM] Fix llvm.mlir.global mismatching print and parser order (llvm#138986)
  [lld][NFC] Fix minor typo in docs (llvm#138898)
  [RISCV] Migrate getConstant indexed insert/extract subvector to new API (llvm#139111)
  GlobalISel: Translate minimumnum and maximumnum (llvm#139106)
  [MemProf] Simplify unittest save and restore of options (llvm#139117)
  [BOLT][AArch64] Patch functions targeted by optional relocs (llvm#138750)
  [Coverage] Support -fprofile-list for cold function coverage (llvm#136333)
  Remove unused forward decl (llvm#139108)
  [AMDGPU][NFC] Get rid of OPW constants. (llvm#139074)
  [CIR] Upstream extract op for VectorType (llvm#138413)
  [mlir][xegpu] Handle scalar uniform ops in SIMT distribution.  (llvm#138593)
  [GlobalISel][AMDGPU] Fix handling of v2i128 type for AND, OR, XOR (llvm#138574)
  AMDGPU][True16][CodeGen] FP_Round f64 to f16 in true16 (llvm#128911)
  Reland [Clang] Deprecate `__is_trivially_relocatable` (llvm#139061)
  [HLSL][NFC] Stricter Overload Tests (clamp,max,min,pow) (llvm#138993)
  [MLIR] Fixing the memref linearization size computation for non-packed memref (llvm#138922)
  [TableGen][NFC] Use early exit to simplify large block in emitAction. (llvm#138220)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants