-
Notifications
You must be signed in to change notification settings - Fork 13.5k
AMDGPU][True16][CodeGen] FP_Round f64 to f16 in true16 #128911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-amdgpu Author: Brox Chen (broxigarchen) Changesfptrunc from f64 to f16 in true16 mode Patch is 62.05 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/128911.diff 5 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index ade81f17ecca5..dca2983f94be8 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -3578,15 +3578,22 @@ SDValue AMDGPUTargetLowering::LowerFP_TO_FP16(SDValue Op, SelectionDAG &DAG) con
return SDValue();
}
- assert(N0.getSimpleValueType() == MVT::f64);
+ return LowerF64ToF16(N0, Op.getValueType(), DL, DAG);
+}
+
+SDValue AMDGPUTargetLowering::LowerF64ToF16(SDValue Src, EVT ResTy,
+ const SDLoc &DL,
+ SelectionDAG &DAG) const {
+ assert(Src.getSimpleValueType() == MVT::f64);
// f64 -> f16 conversion using round-to-nearest-even rounding mode.
+ // TODO: We can generate better code for True16.
const unsigned ExpMask = 0x7ff;
const unsigned ExpBiasf64 = 1023;
const unsigned ExpBiasf16 = 15;
SDValue Zero = DAG.getConstant(0, DL, MVT::i32);
SDValue One = DAG.getConstant(1, DL, MVT::i32);
- SDValue U = DAG.getNode(ISD::BITCAST, DL, MVT::i64, N0);
+ SDValue U = DAG.getNode(ISD::BITCAST, DL, MVT::i64, Src);
SDValue UH = DAG.getNode(ISD::SRL, DL, MVT::i64, U,
DAG.getConstant(32, DL, MVT::i64));
UH = DAG.getZExtOrTrunc(UH, DL, MVT::i32);
@@ -3661,7 +3668,7 @@ SDValue AMDGPUTargetLowering::LowerFP_TO_FP16(SDValue Op, SelectionDAG &DAG) con
DAG.getConstant(0x8000, DL, MVT::i32));
V = DAG.getNode(ISD::OR, DL, MVT::i32, Sign, V);
- return DAG.getZExtOrTrunc(V, DL, Op.getValueType());
+ return DAG.getZExtOrTrunc(V, DL, ResTy);
}
SDValue AMDGPUTargetLowering::LowerFP_TO_INT(const SDValue Op,
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h
index c74dc7942f52c..7b73b7b6a7127 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h
@@ -97,6 +97,9 @@ class AMDGPUTargetLowering : public TargetLowering {
SDValue LowerFP_TO_FP16(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerFP_TO_INT(SDValue Op, SelectionDAG &DAG) const;
+ SDValue LowerF64ToF16(SDValue Src, EVT ResTy, const SDLoc &DL,
+ SelectionDAG &DAG) const;
+
SDValue LowerSIGN_EXTEND_INREG(SDValue Op, SelectionDAG &DAG) const;
protected:
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index fe095414e5172..1c9c8422e2257 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -6825,6 +6825,17 @@ SDValue SITargetLowering::lowerFP_ROUND(SDValue Op, SelectionDAG &DAG) const {
SDLoc DL(Op);
+ if (Subtarget->useRealTrue16Insts()) {
+ if (getTargetMachine().Options.UnsafeFPMath) {
+ SDValue Flags = Op.getOperand(1);
+ SDValue Src32 = DAG.getNode(ISD::FP_ROUND, DL, MVT::f32, Src, Flags);
+ return DAG.getNode(ISD::FP_ROUND, DL, MVT::f16, Src32, Flags);
+ }
+
+ SDValue FpToFp16 = LowerF64ToF16(Src, MVT::i16, DL, DAG);
+ return DAG.getNode(ISD::BITCAST, DL, MVT::f16, FpToFp16);
+ }
+
SDValue FpToFp16 = DAG.getNode(ISD::FP_TO_FP16, DL, MVT::i32, Src);
SDValue Trunc = DAG.getNode(ISD::TRUNCATE, DL, MVT::i16, FpToFp16);
return DAG.getNode(ISD::BITCAST, DL, MVT::f16, Trunc);
diff --git a/llvm/test/CodeGen/AMDGPU/fptrunc.f16.ll b/llvm/test/CodeGen/AMDGPU/fptrunc.f16.ll
index 188b2ada64686..9057c81cbd555 100644
--- a/llvm/test/CodeGen/AMDGPU/fptrunc.f16.ll
+++ b/llvm/test/CodeGen/AMDGPU/fptrunc.f16.ll
@@ -7,8 +7,10 @@
; RUN: llc -amdgpu-scalarize-global-loads=false -mtriple=amdgcn -mcpu=gfx900 -global-isel=1 -mattr=-flat-for-global -denormal-fp-math=preserve-sign -verify-machineinstrs -enable-unsafe-fp-math < %s | FileCheck -enable-var-scope -check-prefixes=GFX9-GISEL %s
; RUN: llc -amdgpu-scalarize-global-loads=false -mtriple=amdgcn -mcpu=gfx950 -global-isel=0 -mattr=-flat-for-global -denormal-fp-math=preserve-sign -verify-machineinstrs -enable-unsafe-fp-math < %s | FileCheck -enable-var-scope -check-prefixes=GFX950-SDAG %s
; RUN: llc -amdgpu-scalarize-global-loads=false -mtriple=amdgcn -mcpu=gfx950 -global-isel=1 -mattr=-flat-for-global -denormal-fp-math=preserve-sign -verify-machineinstrs -enable-unsafe-fp-math < %s | FileCheck -enable-var-scope -check-prefixes=GFX950-GISEL %s
-; RUN: llc -amdgpu-scalarize-global-loads=false -mtriple=amdgcn -mcpu=gfx1100 -global-isel=0 -mattr=-flat-for-global -denormal-fp-math=preserve-sign -verify-machineinstrs -enable-unsafe-fp-math < %s | FileCheck -enable-var-scope -check-prefixes=GFX11-SDAG %s
-; RUN: llc -amdgpu-scalarize-global-loads=false -mtriple=amdgcn -mcpu=gfx1100 -global-isel=1 -mattr=-flat-for-global -denormal-fp-math=preserve-sign -verify-machineinstrs -enable-unsafe-fp-math < %s | FileCheck -enable-var-scope -check-prefixes=GFX11-GISEL %s
+; RUN: llc -amdgpu-scalarize-global-loads=false -mtriple=amdgcn -mcpu=gfx1100 -global-isel=0 -mattr=-flat-for-global,+real-true16 -denormal-fp-math=preserve-sign -verify-machineinstrs -enable-unsafe-fp-math < %s | FileCheck -enable-var-scope -check-prefixes=GFX11-SDAG-TRUE16 %s
+; RUN: llc -amdgpu-scalarize-global-loads=false -mtriple=amdgcn -mcpu=gfx1100 -global-isel=0 -mattr=-flat-for-global,-real-true16 -denormal-fp-math=preserve-sign -verify-machineinstrs -enable-unsafe-fp-math < %s | FileCheck -enable-var-scope -check-prefixes=GFX11-SDAG-FAKE16 %s
+; RUN: llc -amdgpu-scalarize-global-loads=false -mtriple=amdgcn -mcpu=gfx1100 -global-isel=1 -mattr=-flat-for-global,+real-true16 -denormal-fp-math=preserve-sign -verify-machineinstrs -enable-unsafe-fp-math < %s | FileCheck -enable-var-scope -check-prefixes=GFX11-GISEL-TRUE16 %s
+; RUN: llc -amdgpu-scalarize-global-loads=false -mtriple=amdgcn -mcpu=gfx1100 -global-isel=1 -mattr=-flat-for-global,-real-true16 -denormal-fp-math=preserve-sign -verify-machineinstrs -enable-unsafe-fp-math < %s | FileCheck -enable-var-scope -check-prefixes=GFX11-GISEL-FAKE16 %s
define amdgpu_kernel void @fptrunc_f32_to_f16(
; SI-SDAG-LABEL: fptrunc_f32_to_f16:
@@ -131,35 +133,65 @@ define amdgpu_kernel void @fptrunc_f32_to_f16(
; GFX950-GISEL-NEXT: buffer_store_short v0, off, s[0:3], 0
; GFX950-GISEL-NEXT: s_endpgm
;
-; GFX11-SDAG-LABEL: fptrunc_f32_to_f16:
-; GFX11-SDAG: ; %bb.0: ; %entry
-; GFX11-SDAG-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
-; GFX11-SDAG-NEXT: s_mov_b32 s6, -1
-; GFX11-SDAG-NEXT: s_mov_b32 s7, 0x31016000
-; GFX11-SDAG-NEXT: s_mov_b32 s10, s6
-; GFX11-SDAG-NEXT: s_mov_b32 s11, s7
-; GFX11-SDAG-NEXT: s_waitcnt lgkmcnt(0)
-; GFX11-SDAG-NEXT: s_mov_b32 s8, s2
-; GFX11-SDAG-NEXT: s_mov_b32 s9, s3
-; GFX11-SDAG-NEXT: s_mov_b32 s4, s0
-; GFX11-SDAG-NEXT: buffer_load_b32 v0, off, s[8:11], 0
-; GFX11-SDAG-NEXT: s_mov_b32 s5, s1
-; GFX11-SDAG-NEXT: s_waitcnt vmcnt(0)
-; GFX11-SDAG-NEXT: v_cvt_f16_f32_e32 v0, v0
-; GFX11-SDAG-NEXT: buffer_store_b16 v0, off, s[4:7], 0
-; GFX11-SDAG-NEXT: s_endpgm
-;
-; GFX11-GISEL-LABEL: fptrunc_f32_to_f16:
-; GFX11-GISEL: ; %bb.0: ; %entry
-; GFX11-GISEL-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
-; GFX11-GISEL-NEXT: s_waitcnt lgkmcnt(0)
-; GFX11-GISEL-NEXT: s_load_b32 s2, s[2:3], 0x0
-; GFX11-GISEL-NEXT: s_mov_b32 s3, 0x31016000
-; GFX11-GISEL-NEXT: s_waitcnt lgkmcnt(0)
-; GFX11-GISEL-NEXT: v_cvt_f16_f32_e32 v0, s2
-; GFX11-GISEL-NEXT: s_mov_b32 s2, -1
-; GFX11-GISEL-NEXT: buffer_store_b16 v0, off, s[0:3], 0
-; GFX11-GISEL-NEXT: s_endpgm
+; GFX11-SDAG-TRUE16-LABEL: fptrunc_f32_to_f16:
+; GFX11-SDAG-TRUE16: ; %bb.0: ; %entry
+; GFX11-SDAG-TRUE16-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
+; GFX11-SDAG-TRUE16-NEXT: s_mov_b32 s6, -1
+; GFX11-SDAG-TRUE16-NEXT: s_mov_b32 s7, 0x31016000
+; GFX11-SDAG-TRUE16-NEXT: s_mov_b32 s10, s6
+; GFX11-SDAG-TRUE16-NEXT: s_mov_b32 s11, s7
+; GFX11-SDAG-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-SDAG-TRUE16-NEXT: s_mov_b32 s8, s2
+; GFX11-SDAG-TRUE16-NEXT: s_mov_b32 s9, s3
+; GFX11-SDAG-TRUE16-NEXT: s_mov_b32 s4, s0
+; GFX11-SDAG-TRUE16-NEXT: buffer_load_b32 v0, off, s[8:11], 0
+; GFX11-SDAG-TRUE16-NEXT: s_mov_b32 s5, s1
+; GFX11-SDAG-TRUE16-NEXT: s_waitcnt vmcnt(0)
+; GFX11-SDAG-TRUE16-NEXT: v_cvt_f16_f32_e32 v0.l, v0
+; GFX11-SDAG-TRUE16-NEXT: buffer_store_b16 v0, off, s[4:7], 0
+; GFX11-SDAG-TRUE16-NEXT: s_endpgm
+;
+; GFX11-SDAG-FAKE16-LABEL: fptrunc_f32_to_f16:
+; GFX11-SDAG-FAKE16: ; %bb.0: ; %entry
+; GFX11-SDAG-FAKE16-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
+; GFX11-SDAG-FAKE16-NEXT: s_mov_b32 s6, -1
+; GFX11-SDAG-FAKE16-NEXT: s_mov_b32 s7, 0x31016000
+; GFX11-SDAG-FAKE16-NEXT: s_mov_b32 s10, s6
+; GFX11-SDAG-FAKE16-NEXT: s_mov_b32 s11, s7
+; GFX11-SDAG-FAKE16-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-SDAG-FAKE16-NEXT: s_mov_b32 s8, s2
+; GFX11-SDAG-FAKE16-NEXT: s_mov_b32 s9, s3
+; GFX11-SDAG-FAKE16-NEXT: s_mov_b32 s4, s0
+; GFX11-SDAG-FAKE16-NEXT: buffer_load_b32 v0, off, s[8:11], 0
+; GFX11-SDAG-FAKE16-NEXT: s_mov_b32 s5, s1
+; GFX11-SDAG-FAKE16-NEXT: s_waitcnt vmcnt(0)
+; GFX11-SDAG-FAKE16-NEXT: v_cvt_f16_f32_e32 v0, v0
+; GFX11-SDAG-FAKE16-NEXT: buffer_store_b16 v0, off, s[4:7], 0
+; GFX11-SDAG-FAKE16-NEXT: s_endpgm
+;
+; GFX11-GISEL-TRUE16-LABEL: fptrunc_f32_to_f16:
+; GFX11-GISEL-TRUE16: ; %bb.0: ; %entry
+; GFX11-GISEL-TRUE16-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
+; GFX11-GISEL-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-GISEL-TRUE16-NEXT: s_load_b32 s2, s[2:3], 0x0
+; GFX11-GISEL-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
+; GFX11-GISEL-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-GISEL-TRUE16-NEXT: v_cvt_f16_f32_e32 v0.l, s2
+; GFX11-GISEL-TRUE16-NEXT: s_mov_b32 s2, -1
+; GFX11-GISEL-TRUE16-NEXT: buffer_store_b16 v0, off, s[0:3], 0
+; GFX11-GISEL-TRUE16-NEXT: s_endpgm
+;
+; GFX11-GISEL-FAKE16-LABEL: fptrunc_f32_to_f16:
+; GFX11-GISEL-FAKE16: ; %bb.0: ; %entry
+; GFX11-GISEL-FAKE16-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
+; GFX11-GISEL-FAKE16-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-GISEL-FAKE16-NEXT: s_load_b32 s2, s[2:3], 0x0
+; GFX11-GISEL-FAKE16-NEXT: s_mov_b32 s3, 0x31016000
+; GFX11-GISEL-FAKE16-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-GISEL-FAKE16-NEXT: v_cvt_f16_f32_e32 v0, s2
+; GFX11-GISEL-FAKE16-NEXT: s_mov_b32 s2, -1
+; GFX11-GISEL-FAKE16-NEXT: buffer_store_b16 v0, off, s[0:3], 0
+; GFX11-GISEL-FAKE16-NEXT: s_endpgm
ptr addrspace(1) %r,
ptr addrspace(1) %a) {
entry:
@@ -298,39 +330,73 @@ define amdgpu_kernel void @fptrunc_f64_to_f16(
; GFX950-GISEL-NEXT: buffer_store_short v0, off, s[0:3], 0
; GFX950-GISEL-NEXT: s_endpgm
;
-; GFX11-SDAG-LABEL: fptrunc_f64_to_f16:
-; GFX11-SDAG: ; %bb.0: ; %entry
-; GFX11-SDAG-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
-; GFX11-SDAG-NEXT: s_mov_b32 s6, -1
-; GFX11-SDAG-NEXT: s_mov_b32 s7, 0x31016000
-; GFX11-SDAG-NEXT: s_mov_b32 s10, s6
-; GFX11-SDAG-NEXT: s_mov_b32 s11, s7
-; GFX11-SDAG-NEXT: s_waitcnt lgkmcnt(0)
-; GFX11-SDAG-NEXT: s_mov_b32 s8, s2
-; GFX11-SDAG-NEXT: s_mov_b32 s9, s3
-; GFX11-SDAG-NEXT: s_mov_b32 s4, s0
-; GFX11-SDAG-NEXT: buffer_load_b64 v[0:1], off, s[8:11], 0
-; GFX11-SDAG-NEXT: s_mov_b32 s5, s1
-; GFX11-SDAG-NEXT: s_waitcnt vmcnt(0)
-; GFX11-SDAG-NEXT: v_cvt_f32_f64_e32 v0, v[0:1]
-; GFX11-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-SDAG-NEXT: v_cvt_f16_f32_e32 v0, v0
-; GFX11-SDAG-NEXT: buffer_store_b16 v0, off, s[4:7], 0
-; GFX11-SDAG-NEXT: s_endpgm
-;
-; GFX11-GISEL-LABEL: fptrunc_f64_to_f16:
-; GFX11-GISEL: ; %bb.0: ; %entry
-; GFX11-GISEL-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
-; GFX11-GISEL-NEXT: s_waitcnt lgkmcnt(0)
-; GFX11-GISEL-NEXT: s_load_b64 s[2:3], s[2:3], 0x0
-; GFX11-GISEL-NEXT: s_waitcnt lgkmcnt(0)
-; GFX11-GISEL-NEXT: v_cvt_f32_f64_e32 v0, s[2:3]
-; GFX11-GISEL-NEXT: s_mov_b32 s2, -1
-; GFX11-GISEL-NEXT: s_mov_b32 s3, 0x31016000
-; GFX11-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-GISEL-NEXT: v_cvt_f16_f32_e32 v0, v0
-; GFX11-GISEL-NEXT: buffer_store_b16 v0, off, s[0:3], 0
-; GFX11-GISEL-NEXT: s_endpgm
+; GFX11-SDAG-TRUE16-LABEL: fptrunc_f64_to_f16:
+; GFX11-SDAG-TRUE16: ; %bb.0: ; %entry
+; GFX11-SDAG-TRUE16-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
+; GFX11-SDAG-TRUE16-NEXT: s_mov_b32 s6, -1
+; GFX11-SDAG-TRUE16-NEXT: s_mov_b32 s7, 0x31016000
+; GFX11-SDAG-TRUE16-NEXT: s_mov_b32 s10, s6
+; GFX11-SDAG-TRUE16-NEXT: s_mov_b32 s11, s7
+; GFX11-SDAG-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-SDAG-TRUE16-NEXT: s_mov_b32 s8, s2
+; GFX11-SDAG-TRUE16-NEXT: s_mov_b32 s9, s3
+; GFX11-SDAG-TRUE16-NEXT: s_mov_b32 s4, s0
+; GFX11-SDAG-TRUE16-NEXT: buffer_load_b64 v[0:1], off, s[8:11], 0
+; GFX11-SDAG-TRUE16-NEXT: s_mov_b32 s5, s1
+; GFX11-SDAG-TRUE16-NEXT: s_waitcnt vmcnt(0)
+; GFX11-SDAG-TRUE16-NEXT: v_cvt_f32_f64_e32 v0, v[0:1]
+; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-SDAG-TRUE16-NEXT: v_cvt_f16_f32_e32 v0.l, v0
+; GFX11-SDAG-TRUE16-NEXT: buffer_store_b16 v0, off, s[4:7], 0
+; GFX11-SDAG-TRUE16-NEXT: s_endpgm
+;
+; GFX11-SDAG-FAKE16-LABEL: fptrunc_f64_to_f16:
+; GFX11-SDAG-FAKE16: ; %bb.0: ; %entry
+; GFX11-SDAG-FAKE16-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
+; GFX11-SDAG-FAKE16-NEXT: s_mov_b32 s6, -1
+; GFX11-SDAG-FAKE16-NEXT: s_mov_b32 s7, 0x31016000
+; GFX11-SDAG-FAKE16-NEXT: s_mov_b32 s10, s6
+; GFX11-SDAG-FAKE16-NEXT: s_mov_b32 s11, s7
+; GFX11-SDAG-FAKE16-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-SDAG-FAKE16-NEXT: s_mov_b32 s8, s2
+; GFX11-SDAG-FAKE16-NEXT: s_mov_b32 s9, s3
+; GFX11-SDAG-FAKE16-NEXT: s_mov_b32 s4, s0
+; GFX11-SDAG-FAKE16-NEXT: buffer_load_b64 v[0:1], off, s[8:11], 0
+; GFX11-SDAG-FAKE16-NEXT: s_mov_b32 s5, s1
+; GFX11-SDAG-FAKE16-NEXT: s_waitcnt vmcnt(0)
+; GFX11-SDAG-FAKE16-NEXT: v_cvt_f32_f64_e32 v0, v[0:1]
+; GFX11-SDAG-FAKE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-SDAG-FAKE16-NEXT: v_cvt_f16_f32_e32 v0, v0
+; GFX11-SDAG-FAKE16-NEXT: buffer_store_b16 v0, off, s[4:7], 0
+; GFX11-SDAG-FAKE16-NEXT: s_endpgm
+;
+; GFX11-GISEL-TRUE16-LABEL: fptrunc_f64_to_f16:
+; GFX11-GISEL-TRUE16: ; %bb.0: ; %entry
+; GFX11-GISEL-TRUE16-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
+; GFX11-GISEL-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-GISEL-TRUE16-NEXT: s_load_b64 s[2:3], s[2:3], 0x0
+; GFX11-GISEL-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-GISEL-TRUE16-NEXT: v_cvt_f32_f64_e32 v0, s[2:3]
+; GFX11-GISEL-TRUE16-NEXT: s_mov_b32 s2, -1
+; GFX11-GISEL-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
+; GFX11-GISEL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-GISEL-TRUE16-NEXT: v_cvt_f16_f32_e32 v0.l, v0
+; GFX11-GISEL-TRUE16-NEXT: buffer_store_b16 v0, off, s[0:3], 0
+; GFX11-GISEL-TRUE16-NEXT: s_endpgm
+;
+; GFX11-GISEL-FAKE16-LABEL: fptrunc_f64_to_f16:
+; GFX11-GISEL-FAKE16: ; %bb.0: ; %entry
+; GFX11-GISEL-FAKE16-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
+; GFX11-GISEL-FAKE16-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-GISEL-FAKE16-NEXT: s_load_b64 s[2:3], s[2:3], 0x0
+; GFX11-GISEL-FAKE16-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-GISEL-FAKE16-NEXT: v_cvt_f32_f64_e32 v0, s[2:3]
+; GFX11-GISEL-FAKE16-NEXT: s_mov_b32 s2, -1
+; GFX11-GISEL-FAKE16-NEXT: s_mov_b32 s3, 0x31016000
+; GFX11-GISEL-FAKE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-GISEL-FAKE16-NEXT: v_cvt_f16_f32_e32 v0, v0
+; GFX11-GISEL-FAKE16-NEXT: buffer_store_b16 v0, off, s[0:3], 0
+; GFX11-GISEL-FAKE16-NEXT: s_endpgm
ptr addrspace(1) %r,
ptr addrspace(1) %a) {
entry:
@@ -477,41 +543,77 @@ define amdgpu_kernel void @fptrunc_v2f32_to_v2f16(
; GFX950-GISEL-NEXT: buffer_store_dword v0, off, s[0:3], 0
; GFX950-GISEL-NEXT: s_endpgm
;
-; GFX11-SDAG-LABEL: fptrunc_v2f32_to_v2f16:
-; GFX11-SDAG: ; %bb.0: ; %entry
-; GFX11-SDAG-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
-; GFX11-SDAG-NEXT: s_mov_b32 s6, -1
-; GFX11-SDAG-NEXT: s_mov_b32 s7, 0x31016000
-; GFX11-SDAG-NEXT: s_mov_b32 s10, s6
-; GFX11-SDAG-NEXT: s_mov_b32 s11, s7
-; GFX11-SDAG-NEXT: s_waitcnt lgkmcnt(0)
-; GFX11-SDAG-NEXT: s_mov_b32 s8, s2
-; GFX11-SDAG-NEXT: s_mov_b32 s9, s3
-; GFX11-SDAG-NEXT: s_mov_b32 s4, s0
-; GFX11-SDAG-NEXT: buffer_load_b64 v[0:1], off, s[8:11], 0
-; GFX11-SDAG-NEXT: s_mov_b32 s5, s1
-; GFX11-SDAG-NEXT: s_waitcnt vmcnt(0)
-; GFX11-SDAG-NEXT: v_cvt_f16_f32_e32 v1, v1
-; GFX11-SDAG-NEXT: v_cvt_f16_f32_e32 v0, v0
-; GFX11-SDAG-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-SDAG-NEXT: v_pack_b32_f16 v0, v0, v1
-; GFX11-SDAG-NEXT: buffer_store_b32 v0, off, s[4:7], 0
-; GFX11-SDAG-NEXT: s_endpgm
-;
-; GFX11-GISEL-LABEL: fptrunc_v2f32_to_v2f16:
-; GFX11-GISEL: ; %bb.0: ; %entry
-; GFX11-GISEL-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
-; GFX11-GISEL-NEXT: s_waitcnt lgkmcnt(0)
-; GFX11-GISEL-NEXT: s_load_b64 s[2:3], s[2:3], 0x0
-; GFX11-GISEL-NEXT: s_waitcnt lgkmcnt(0)
-; GFX11-GISEL-NEXT: v_cvt_f16_f32_e32 v0, s2
-; GFX11-GISEL-NEXT: v_cvt_f16_f32_e32 v1, s3
-; GFX11-GISEL-NEXT: s_mov_b32 s2, -1
-; GFX11-GISEL-NEXT: s_mov_b32 s3, 0x31016000
-; GFX11-GISEL-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-GISEL-NEXT: v_pack_b32_f16 v0, v0, v1
-; GFX11-GISEL-NEXT: buffer_store_b32 v0, off, s[0:3], 0
-; GFX11-GISEL-NEXT: s_endpgm
+; GFX11-SDAG-TRUE16-LABEL: fptrunc_v2f32_to_v2f16:
+; GFX11-SDAG-TRUE16: ; %bb.0: ; %entry
+; GFX11-SDAG-TRUE16-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
+; GFX11-SDAG-TRUE16-NEXT: s_mov_b32 s6, -1
+; GFX11-SDAG-TRUE16-NEXT: s_mov_b32 s7, 0x31016000
+; GFX11-SDAG-TRUE16-NEXT: s_mov_b32 s10, s6
+; GFX11-SDAG-TRUE16-NEXT: s_mov_b32 s11, s7
+; GFX11-SDAG-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-SDAG-TRUE16-NEXT: s_mov_b32 s8, s2
+; GFX11-SDAG-TRUE16-NEXT: s_mov_b32 s9, s3
+; GFX11-SDAG-TRUE16-NEXT: s_mov_b32 s4, s0
+; GFX11-SDAG-TRUE16-NEXT: buffer_load_b64 v[1:2], off, s[8:11], 0
+; GFX11-SDAG-TRUE16-NEXT: s_mov_b32 s5, s1
+; GFX11-SDAG-TRUE16-NEXT: s_waitcnt vmcnt(0)
+; GFX11-SDAG-TRUE16-NEXT: v_cvt_f16_f32_e32 v0.l, v2
+; GFX11-SDAG-TRUE16-NEXT: v_cvt_f16_f32_e32 v0.h, v1
+; GFX11-SDAG-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-SDAG-TRUE16-NEXT: v_pack_b32_f16 v0, v0.h, v0.l
+; GFX11-SDAG-TRUE16-NEXT: buffer_store_b32 v0, off, s[4:7], 0
+; GFX11-SDAG-TRUE16-NEXT: s_endpgm
+;
+; GFX11-SDAG-FAKE16-LABEL: fptrunc_v2f32_to_v2f16:
+; GFX11-SDAG-FAKE16: ; %bb.0: ; %entry
+; GFX11-SDAG-FAKE16-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
+; GFX11-SDAG-FAKE16-NEXT: s_mov_b32 s6, -1
+; GFX11-SDAG-FAKE16-NEXT: s_mov_b32 s7, 0x31016000
+; GFX11-SDAG-FAKE16-NEXT: s_mov_b32 s10, s6
+; GFX11-SDAG-FAKE16-NEXT: s_mov_b32 s11, s7
+; GFX11-SDAG-FAKE16-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-SDAG-FAKE16-NEXT: s_mov_b32 s8, s2
+; GFX11-SDAG-FAKE16-NEXT: s_mov_b32 s9, s3
+; GFX11-SDAG-FAKE16-NEXT: s_mov_b32 s4, s0
+; GFX11-SDAG-FAKE16-NEXT: buffer_load_b64 v[0:1], off, s[8:11], 0
+; GFX11-SDAG-FAKE16-NEXT: s_mov_b32 s5, s1
+; GFX11-SDAG-FAKE16-NEXT: s_waitcnt vmcnt(0)
+; GFX11-SDAG-FAKE16-NEXT: v_cvt_f16_f32_e32 v1, v1
+; GFX11-SDAG-FAKE16-NEXT: v_cvt_f16_f32_e32 v0, v0
+; GFX11-SDAG-FAKE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-SDAG-FAKE16-NEXT: v_pack_b32_f16 v0, v0, v1
+; GFX11-SDAG-FAKE16-NEXT: buffer_store_b32 v0, off, s[4:7], 0
+; GFX11-SDAG-FAKE16-NEXT: s_endpgm
+;
+; GFX11-GISEL-TRUE16-LABEL: fptrunc_v2f32_to_v2f16:
+; GFX11-GISEL-TRUE16: ; %bb.0: ; %entry
+; GFX11-GISEL-TRUE16-NEXT: s_load_b128 s[0:3], s[4:5], 0x24
+; GFX11-GISEL-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-GISEL-TRUE16-NEXT: s_load_b64 s[2:3], s[2:3], 0x0
+; GFX11-GISEL-TRUE16-NEXT: s_waitcnt lgkmcnt(0)
+; GFX11-GISEL-TRUE16-NEXT: v_cvt_f16_f32_e32 v0.l, s2
+; GFX11-GISEL-TRUE16-NEXT: v_cvt_f16_f32_e32 v0.h, s3
+; GFX11-GISEL-TRUE16-NEXT: s_mov_b32 s2, -1
+; GFX11-GISEL-TRUE16-NEXT: s_mov_b32 s3, 0x31016000
+; GFX11-GISEL-TRUE16-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX11-GISEL-TRUE16-NEXT: v_pack_b32_f16 v0, v0.l, v0.h
+; GFX11-GISEL-TRUE16-NEXT: buffer_store_b32 v0, off, s[0:3], 0
+; GFX11-GISEL-TRUE16-NEXT: s_endpgm
+;
+; GFX11-GISEL-FAKE16-LABEL: fptrunc_v2f32_to_v2f16:
+; GFX11-GISEL-FAKE16: ; %bb.0: ...
[truncated]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this change. You shouldn't have to make any of this code conditional on true16. Are you just avoiding the FP_TO_FP16? We can and should do that independently of true16
Hi Matt. Sorry for the delay. Just checked this PR (I am not the original author). We basically do FP64 lowering differently between true16 and fake16 mode. The unsafe mode "F64 to F16" is lowered differently between true16/fake16 flow. Fake16 mode lowered to FP_TO_FP16 while in true16 mode it get lowered to two FP_ROUND. (This patch https://reviews.llvm.org/D154528 stops from combining these two FP_ROUND back). The safe mode lowering is actually the same between true16/fake16. They both eventually go to LowerF64ToF16. But the post lowering operation is changed as in true16 mode we don't need to do fptrunc but just need a bitcast. |
I thought that FP_TO_FP16 should only be used on targets that put fp16 values in 32-bit registers, which AMDGPU traditionally does but True16 does not. |
Hi @arsenm and @jayfoad. Do you think it's good to continue with this implementation? If it's not, please let me know and I am happy to change it here. Thanks for the review! |
@arsenm so when should we use FP_TO_FP16? Is it only for targets where f16 is not legal? That would include AMDGPU subtargets where has16BitInsts() is false. @broxigarchen I think we need to sort this out first, and then revisit the current patch. |
Yes, it's a hack to avoid referring to refer to f16 values as i32. None of this should depend on true16 |
Hi Matt and Jay. Thanks for the explaination! Tried to update this patch according to the comments and please let me know if this still does not sound right |
gentle ping! |
Gentle ping again! |
✅ With the latest revision this PR passed the C/C++ code formatter. |
rebased and resolved conflicts |
if (!Subtarget->has16BitInsts()) { | ||
SDValue FpToFp16 = DAG.getNode(ISD::FP_TO_FP16, DL, MVT::i32, Src); | ||
SDValue Trunc = DAG.getNode(ISD::TRUNCATE, DL, MVT::i16, FpToFp16); | ||
return DAG.getNode(ISD::BITCAST, DL, MVT::f16, Trunc); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by this. FP_TO_FP16 does the right thing for an f64 source? Should the unsafe path be before this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Matt. Before this patch, FP_TO_FP16 is used for all targets for FP_ROUND f64 source so I think it suppose to do the right thing. However, I took a quick look at lit tests and we seems don't have any lit test that covers FP_ROUND f64 source in the old targets that do not support f16 types.
So I would say the FP_TO_FP16 here keeps the same code path for old targets. I think it's better to merge as it and If we want to add test coverage or any fixfor old targets we can do it in another patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Matt. Do you have any comment on this? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we seems don't have any lit test that covers FP_ROUND f64 source in the old targets that do not support f16 types.
This is covered, fptrunc.ll has it. fptrunc.f16.ll should also cover it, but that one brokenly only tests with -enable-unsafe-fp-math
It seems to work because we have explicit custom lowering f64 source. Seems like an unnecessary complication over the default expansion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I just realized SI is gfx6. Will create another patch to address these
rebased and squash commits |
if (!Subtarget->has16BitInsts()) { | ||
SDValue FpToFp16 = DAG.getNode(ISD::FP_TO_FP16, DL, MVT::i32, Src); | ||
SDValue Trunc = DAG.getNode(ISD::TRUNCATE, DL, MVT::i16, FpToFp16); | ||
return DAG.getNode(ISD::BITCAST, DL, MVT::f16, Trunc); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we seems don't have any lit test that covers FP_ROUND f64 source in the old targets that do not support f16 types.
This is covered, fptrunc.ll has it. fptrunc.f16.ll should also cover it, but that one brokenly only tests with -enable-unsafe-fp-math
It seems to work because we have explicit custom lowering f64 source. Seems like an unnecessary complication over the default expansion
SDValue Src32 = DAG.getNode(ISD::FP_ROUND, DL, MVT::f32, Src, Flags); | ||
return DAG.getNode(ISD::FP_ROUND, DL, MVT::f16, Src32, Flags); | ||
} | ||
SDValue FpToFp16 = LowerF64ToF16Safe(Src, DL, DAG); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still think this is an awkward split
SDValue Trunc = DAG.getNode(ISD::TRUNCATE, DL, MVT::i16, FpToFp16); | ||
return DAG.getNode(ISD::BITCAST, DL, MVT::f16, Trunc); | ||
} | ||
if (getTargetMachine().Options.UnsafeFPMath) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in a follow up should move to use the flags from the node
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/56/builds/25371 Here is the relevant piece of the build log for the reference
|
* main: (420 commits) [AArch64] Merge scaled and unscaled narrow zero stores (llvm#136705) [RISCV] One last migration to getInsertSubvector [nfc] [flang][OpenMP] Update `do concurrent` mapping pass to use `fir.do_concurrent` op (llvm#138489) [MLIR][LLVM] Fix llvm.mlir.global mismatching print and parser order (llvm#138986) [lld][NFC] Fix minor typo in docs (llvm#138898) [RISCV] Migrate getConstant indexed insert/extract subvector to new API (llvm#139111) GlobalISel: Translate minimumnum and maximumnum (llvm#139106) [MemProf] Simplify unittest save and restore of options (llvm#139117) [BOLT][AArch64] Patch functions targeted by optional relocs (llvm#138750) [Coverage] Support -fprofile-list for cold function coverage (llvm#136333) Remove unused forward decl (llvm#139108) [AMDGPU][NFC] Get rid of OPW constants. (llvm#139074) [CIR] Upstream extract op for VectorType (llvm#138413) [mlir][xegpu] Handle scalar uniform ops in SIMT distribution. (llvm#138593) [GlobalISel][AMDGPU] Fix handling of v2i128 type for AND, OR, XOR (llvm#138574) AMDGPU][True16][CodeGen] FP_Round f64 to f16 in true16 (llvm#128911) Reland [Clang] Deprecate `__is_trivially_relocatable` (llvm#139061) [HLSL][NFC] Stricter Overload Tests (clamp,max,min,pow) (llvm#138993) [MLIR] Fixing the memref linearization size computation for non-packed memref (llvm#138922) [TableGen][NFC] Use early exit to simplify large block in emitAction. (llvm#138220) ...
Update the f64 to f16 lowering for targets which support f16 types.
For unsafe mode, lowered to two FP_ROUND. (This patch https://reviews.llvm.org/D154528 stops from combining these two FP_ROUND back). In safe mode, select LowerF64ToF16 (round-to-nearest-even rounding mode)