Skip to content

[AMDGPU] Optimize LDS DMA soft waitcnt #138802

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
Original file line number Diff line number Diff line change
@@ -1278,6 +1278,19 @@ bool WaitcntGeneratorPreGFX12::applyPreexistingWaitcnt(
if (Opcode == AMDGPU::S_WAITCNT) {
unsigned IEnc = II.getOperand(0).getImm();
AMDGPU::Waitcnt OldWait = AMDGPU::decodeWaitcnt(IV, IEnc);

// These pseudo waitcnt instructions are only needed to synchronize DS
// operations with direct LDS loads that use vmcnt. We can safely relax
// them when no outstanding direct LDS loads exist, even if other vmcnt
// events are pending.
if (II.getOpcode() == AMDGPU::S_WAITCNT_VMCNT_LDS_DMA_soft) {
unsigned RegNo = SQ_MAX_PGM_VGPRS + EXTRA_VGPR_LDS;
AMDGPU::Waitcnt LDSDMAWait;
ScoreBrackets.determineWait(LOAD_CNT, RegNo, LDSDMAWait);
if (LDSDMAWait.LoadCnt == ~0u)
OldWait.LoadCnt = ~0u;
}

if (TrySimplify)
ScoreBrackets.simplifyWaitcnt(OldWait);
Wait = Wait.combined(OldWait);
1 change: 1 addition & 0 deletions llvm/lib/Target/AMDGPU/SIInstrInfo.h
Original file line number Diff line number Diff line change
@@ -1010,6 +1010,7 @@ class SIInstrInfo final : public AMDGPUGenInstrInfo {
static unsigned getNonSoftWaitcntOpcode(unsigned Opcode) {
switch (Opcode) {
case AMDGPU::S_WAITCNT_soft:
case AMDGPU::S_WAITCNT_VMCNT_LDS_DMA_soft:
return AMDGPU::S_WAITCNT;
case AMDGPU::S_WAITCNT_VSCNT_soft:
return AMDGPU::S_WAITCNT_VSCNT;
1 change: 1 addition & 0 deletions llvm/lib/Target/AMDGPU/SOPInstructions.td
Original file line number Diff line number Diff line change
@@ -1608,6 +1608,7 @@ let OtherPredicates = [HasImageInsts] in {
def S_WAIT_DSCNT_soft : SOPP_Pseudo <"s_soft_wait_dscnt", (ins s16imm:$simm16), "$simm16">;
def S_WAIT_KMCNT_soft : SOPP_Pseudo <"s_soft_wait_kmcnt", (ins s16imm:$simm16), "$simm16">;
}
def S_WAITCNT_VMCNT_LDS_DMA_soft : SOPP_Pseudo <"s_soft_waitcnt" , (ins SWaitCnt:$simm16), "$simm16">;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please! What is it? How does it interact with S_WAITCNT_VMCNT? Why does SIMemoryLegalizer want to start emitting it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid baking "DMA" into the name? I don't think that terminology has been used since GFX9. Later documents just talk about "load to LDS".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to see the follow-up patch in which SIMemoryLegalizer emits those before this lands, can you create a stacked PR with the MemoryLegalizer changes @kerbowa ?
I don't understand why we need these now, and why it's needed for correctness


def S_SETHALT : SOPP_Pseudo <"s_sethalt" , (ins i32imm:$simm16), "$simm16",
[(int_amdgcn_s_sethalt timm:$simm16)]>;
170 changes: 170 additions & 0 deletions llvm/test/CodeGen/AMDGPU/lds-dma-waitcnt.mir
Original file line number Diff line number Diff line change
@@ -117,3 +117,173 @@ body: |
S_ENDPGM 0

...

# Soft waitcnt should be honored here.
# GCN-LABEL: name: buffer_load_dword_lds_ds_read_soft_wait
# GCN: BUFFER_LOAD_DWORD_LDS_IDXEN
# GCN-NEXT: S_WAITCNT 3952
# vmcnt(0)
# GCN-NEXT: S_BARRIER
---
name: buffer_load_dword_lds_ds_read_soft_wait
body: |
bb.0:
$m0 = S_MOV_B32 0
BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 4), (store (s32) into `ptr addrspace(3) poison` + 4)
S_WAITCNT_VMCNT_LDS_DMA_soft 3952
S_BARRIER
$vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `ptr addrspace(3) poison`)
S_ENDPGM 0

...

# No need for waitcnt.
# GCN-LABEL: name: buffer_store_lds_dword_ds_read_soft_wait
# GCN: BUFFER_STORE_LDS_DWORD
# GCN-NEXT: S_BARRIER
---
name: buffer_store_lds_dword_ds_read_soft_wait
body: |
bb.0:
$m0 = S_MOV_B32 0
BUFFER_STORE_LDS_DWORD $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(3) poison` + 4), (store (s32) into `ptr addrspace(1) poison` + 4)
S_WAITCNT_VMCNT_LDS_DMA_soft 3952
S_BARRIER
$vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `ptr addrspace(3) poison`)
S_ENDPGM 0

...

# Soft waitcnt should mean vmcnt(1) before the barrier and vmcnt(0) after.
# GCN-LABEL: name: series_of_buffer_load_dword_lds_ds_read_soft_wait
# GCN: BUFFER_LOAD_DWORD_LDS_IDXEN
# GCN-NEXT: BUFFER_LOAD_DWORD_LDS_IDXEN
# GCN-NEXT: BUFFER_LOAD_DWORD_LDS_IDXEN
# GCN-NEXT: S_WAITCNT 3953
# vmcnt(1)
# GCN-NEXT: S_BARRIER
# GCN-NEXT: S_WAITCNT 3952
# vmcnt(0)
# GCN-NEXT: DS_READ_B32_gfx9
---
name: series_of_buffer_load_dword_lds_ds_read_soft_wait
body: |
bb.0:
$m0 = S_MOV_B32 0
BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 0, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison`), (store (s32) into `ptr addrspace(3) poison`)
BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 4), (store (s32) into `ptr addrspace(3) poison` + 4)
BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 8, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 8), (store (s32) into `ptr addrspace(3) poison` + 8)
S_WAITCNT_VMCNT_LDS_DMA_soft 3953
S_BARRIER
$vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `ptr addrspace(3) poison`)
S_ENDPGM 0

...

# No waitcnt before the barrier because counter is too high
# GCN-LABEL: name: buffer_load_dword_lds_ds_read_soft_wait_redundant
# GCN: BUFFER_LOAD_DWORD_LDS_IDXEN
# GCN-NEXT: S_BARRIER
# GCN-NEXT: S_WAITCNT 3952
# vmcnt(0)
# GCN-NEXT: DS_READ_B32_gfx9
---
name: buffer_load_dword_lds_ds_read_soft_wait_redundant
body: |
bb.0:
$m0 = S_MOV_B32 0
BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 4), (store (s32) into `ptr addrspace(3) poison` + 4)
S_WAITCNT_VMCNT_LDS_DMA_soft 3953
S_BARRIER
$vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `ptr addrspace(3) poison`)
S_ENDPGM 0

...

# Combine waitcnt.
# GCN-LABEL: name: series_of_buffer_load_dword_lds_ds_read_soft_wait_repeat
# GCN: BUFFER_LOAD_DWORD_LDS_IDXEN
# GCN-NEXT: BUFFER_LOAD_DWORD_LDS_IDXEN
# GCN-NEXT: BUFFER_LOAD_DWORD_LDS_IDXEN
# GCN-NEXT: S_WAITCNT 3953
# vmcnt(1)
# GCN-NEXT: S_BARRIER
# GCN-NEXT: S_WAITCNT 3952
# vmcnt(0)
# GCN-NEXT: DS_READ_B32_gfx9
---
name: series_of_buffer_load_dword_lds_ds_read_soft_wait_repeat
body: |
bb.0:
$m0 = S_MOV_B32 0
BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 0, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison`), (store (s32) into `ptr addrspace(3) poison`)
BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 4), (store (s32) into `ptr addrspace(3) poison` + 4)
BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 8, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 8), (store (s32) into `ptr addrspace(3) poison` + 8)
S_WAITCNT_VMCNT_LDS_DMA_soft 3953
S_WAITCNT_VMCNT_LDS_DMA_soft 3953
S_BARRIER
S_WAITCNT_VMCNT_LDS_DMA_soft 3953
S_WAITCNT_VMCNT_LDS_DMA_soft 3953
$vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `ptr addrspace(3) poison`)
S_ENDPGM 0

...

# Merge waitcnt.
# GCN-LABEL: name: series_of_buffer_load_dword_lds_ds_read_soft_wait_merge
# GCN: BUFFER_LOAD_DWORD_LDS_IDXEN
# GCN-NEXT: BUFFER_LOAD_DWORD_LDS_IDXEN
# GCN-NEXT: BUFFER_LOAD_DWORD_LDS_IDXEN
# GCN-NEXT: S_WAITCNT 3953
# vmcnt(1)
# GCN-NEXT: S_BARRIER
# GCN-NEXT: S_WAITCNT 3952
# vmcnt(0)
# GCN-NEXT: DS_READ_B32_gfx9
---
name: series_of_buffer_load_dword_lds_ds_read_soft_wait_merge
body: |
bb.0:
$m0 = S_MOV_B32 0
BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 0, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison`), (store (s32) into `ptr addrspace(3) poison`)
BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 4), (store (s32) into `ptr addrspace(3) poison` + 4)
BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 8, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 8), (store (s32) into `ptr addrspace(3) poison` + 8)
S_WAITCNT_VMCNT_LDS_DMA_soft 3954
S_WAITCNT_VMCNT_LDS_DMA_soft 3953
S_BARRIER
S_WAITCNT_VMCNT_LDS_DMA_soft 3952
S_WAITCNT_VMCNT_LDS_DMA_soft 3952
$vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `ptr addrspace(3) poison`)
S_ENDPGM 0

...


# Handle the preexisting waitcnt.
# GCN-LABEL: name: series_of_buffer_load_dword_lds_ds_read_soft_wait_preexisting
# GCN: BUFFER_LOAD_DWORD_LDS_IDXEN
# GCN-NEXT: BUFFER_LOAD_DWORD_LDS_IDXEN
# GCN-NEXT: S_WAITCNT 0
# GCN-NEXT: BUFFER_LOAD_DWORD_LDS_IDXEN
# GCN-NEXT: S_BARRIER
# GCN-NEXT: S_WAITCNT 3952
# vmcnt(0)
# GCN-NEXT: DS_READ_B32_gfx9
---
name: series_of_buffer_load_dword_lds_ds_read_soft_wait_preexisting
body: |
bb.0:
$m0 = S_MOV_B32 0
BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 0, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison`), (store (s32) into `ptr addrspace(3) poison`)
BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 4, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 4), (store (s32) into `ptr addrspace(3) poison` + 4)
S_WAITCNT 0
BUFFER_LOAD_DWORD_LDS_IDXEN $vgpr0, $sgpr0_sgpr1_sgpr2_sgpr3, $sgpr4, 8, 0, 0, implicit $exec, implicit $m0 :: (load (s32) from `ptr addrspace(1) poison` + 8), (store (s32) into `ptr addrspace(3) poison` + 8)
S_WAITCNT_VMCNT_LDS_DMA_soft 3953
S_WAITCNT_VMCNT_LDS_DMA_soft 3953
S_BARRIER
S_WAITCNT_VMCNT_LDS_DMA_soft 3953
S_WAITCNT_VMCNT_LDS_DMA_soft 3953
$vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `ptr addrspace(3) poison`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
$vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32) from `ptr addrspace(3) poison`)
$vgpr0 = DS_READ_B32_gfx9 $vgpr1, 0, 0, implicit $m0, implicit $exec :: (load (s32), addrspace 3)

Should fix all the MMOs to have address space directly and drop the filler poison values

S_ENDPGM 0

...