AMDGPU misses optimization on check-all-workitem-ids are 0 pattern #136727

arsenm · 2025-04-22T16:29:08Z

The device libraries include this pattern to check if all workitem IDs are 0.

// RUN: clang -target amdgcn-amd-amdhsa -S -O3 -mcpu=gfx900 -nogpulib < %s
bool
choose_one_workgroup_workitem(void)
{
    return (__builtin_amdgcn_workitem_id_x() | __builtin_amdgcn_workitem_id_y() | __builtin_amdgcn_workitem_id_z()) == 0;
}

https://github.com/ROCm/llvm-project/blob/662bae8d56ae5ba900a81b468936f47769b0fc2d/amd/device-libs/ockl/src/cg.cl#L46

This is equivalent to checking x == 0 && y == 0 && z == 0. If we codegen this, we see:

	v_and_b32_e32 v0, 0x3ff, v31
	v_bfe_u32 v1, v31, 20, 10
	v_bfe_u32 v2, v31, 10, 10
	v_or3_b32 v0, v0, v2, v1
	v_cmp_eq_u32_e32 vcc, 0, v0
	v_cndmask_b32_e64 v0, 0, 1, vcc
	s_setpc_b64 s[30:31]

In the function ABI, the work item IDs are packed into v31. We should be able to just check v31 == 0, so this would shrink to

	v_cmp_eq_u32_e32 vcc, 0, v31
	v_cndmask_b32_e64 v0, 0, 1, vcc
	s_setpc_b64 s[30:31]

The text was updated successfully, but these errors were encountered:

llvmbot · 2025-04-22T16:29:27Z

@llvm/issue-subscribers-backend-amdgpu

Author: Matt Arsenault (arsenm)

The device libraries include this pattern to check if all workitem IDs are 0.

// RUN: clang -target amdgcn-amd-amdhsa -S -O3 -mcpu=gfx900 -nogpulib &lt; %s
bool
choose_one_workgroup_workitem(void)
{
    return (__builtin_amdgcn_workitem_id_x() | __builtin_amdgcn_workitem_id_y() | __builtin_amdgcn_workitem_id_z()) == 0;
}

https://github.com/ROCm/llvm-project/blob/662bae8d56ae5ba900a81b468936f47769b0fc2d/amd/device-libs/ockl/src/cg.cl#L46

This is equivalent to checking x == 0 && y == 0 && z == 0. If we codegen this, we see:

	v_and_b32_e32 v0, 0x3ff, v31
	v_bfe_u32 v1, v31, 20, 10
	v_bfe_u32 v2, v31, 10, 10
	v_or3_b32 v0, v0, v2, v1
	v_cmp_eq_u32_e32 vcc, 0, v0
	v_cndmask_b32_e64 v0, 0, 1, vcc
	s_setpc_b64 s[30:31]

In the function ABI, the work item IDs are packed into v31. We should be able to just check v31 == 0, so this would shrink to

	v_cmp_eq_u32_e32 vcc, 0, v31
	v_cndmask_b32_e64 v0, 0, 1, vcc
	s_setpc_b64 s[30:31]

Pierre-vh · 2025-05-01T11:16:56Z

I want to look at this one, but I have a few questions:

Should we optimize this in ISel or after in some other pass like SIFoldOperands? I feel like ISel fits better.
Do we have any utils that can look through shift/and to find the underlying value + bits extracted?
There's 3 blocks of 10 bits in the register, I assume remaining bits are known to be zero? Otherwise we may still end up with a mask for the lower 30 bits

arsenm · 2025-05-01T11:21:29Z

Should we optimize this in ISel or after in some other pass like SIFoldOperands? I feel like ISel fits better.

Yes, isel. We could also potentially do this in one of the backend IR passes, e.g. AMDGPUCodeGenPrepare.

There's 3 blocks of 10 bits in the register, I assume remaining bits are known to be zero? Otherwise we may still end up with a mask for the lower 30 bits

Do we have any utils that can look through shift/and to find the underlying value + bits extracted?

It's probably easiest to do this before the workitem IDs are lowered. We could also do similar against mbcnt

We probably should guarantee this in the ABI. In practice I don't see how we would end up with undefined high bits, so we should assume they are 0

Pierre-vh · 2025-05-07T08:29:37Z

We probably should guarantee this in the ABI. In practice I don't see how we would end up with undefined high bits, so we should assume they are 0

Where is that documented ? For this optimization to work we must guarantee the two top bits are zero
Otherwise it needs a mask of the two top bits

jayfoad · 2025-05-07T08:37:28Z

Where is that documented ? For this optimization to work we must guarantee the two top bits are zero Otherwise it needs a mask of the two top bits

See for example https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna4-instruction-set-architecture.pdf section 3.5.4. "VGPR Initialization":

Stage CS
VGPR0 {2’b0, Z[9:0], Y[9:0], X[9:0]}
X, Y, and Z work-item position within the workgroup. Work-item 0 is at (0,0,0).

This seems pretty clear that the top two bits are guaranteed to be zero, at least for GFX12.

However, even without that guarantee, we can incrementally improve the codegen for a sequence that extracts two adjacent bitfields from the same source, ORs them together and compares with zero. You can pattern match that and optimize it to a single extract of the combined field.

cdevadas · 2025-05-07T09:14:42Z

However, even without that guarantee, we can incrementally improve the codegen for a sequence that extracts two adjacent bitfields from the same source, ORs them together and compares with zero. You can pattern match that and optimize it to a single extract of the combined field.

This seems like a better alternative to accommodate more similar cases than just our workItemIDs. Can't this be written as a common optimization rather than restricting it to the AMDGPU target?

Pierre-vh · 2025-05-07T09:15:46Z

However, even without that guarantee, we can incrementally improve the codegen for a sequence that extracts two adjacent bitfields from the same source,

Yeah that's what I want to do ideally. I'd prefer to not special case this to the workgroup intrinsic but make it a general pattern instead, but then there's the challenge of looking through the shift/ands (that extract the bits) after the lowering of the workgroup intrinsic

jayfoad · 2025-05-07T09:51:14Z

but then there's the challenge of looking through the shift/ands (that extract the bits) after the lowering of the workgroup intrinsic

GlobalISel has G_SBFX and G_UBFX. Maybe SelectionDAG should use similar high level bitfield extract ops.

shiltian · 2025-05-07T12:42:53Z

This seems like a better alternative to accommodate more similar cases than just our workItemIDs. Can't this be written as a common optimization rather than restricting it to the AMDGPU target?

You'd need to know that the bits can form a complete integer, rather than leaving some bits unfilled or unused. That needs some priori knowledge.

Pierre-vh · 2025-05-08T11:50:11Z

There is also this pattern below which is even more annoying to optimize:

bool
choose_one_grid_workitem(void)
{
    return (__builtin_amdgcn_workitem_id_x() | __builtin_amdgcn_workgroup_id_x() |
            __builtin_amdgcn_workitem_id_y() | __builtin_amdgcn_workgroup_id_y() |
            __builtin_amdgcn_workitem_id_z() | __builtin_amdgcn_workgroup_id_z()) == 0;
}

I don't really like the idea of optimizing this directly on the intrinsic because I think it doesn't generalize well enough; it feels like the wrong place to fix it.

If I add a simple combine to get rid of AssertZExt when the source is a suitable and, we can get this in the DAG for the original case:

SelectionDAG has 18 nodes:
  t0: ch,glue = EntryToken
              t30: i32 = srl # D:1 t24, Constant:i32<10>
            t35: i32 = or # D:1 t24, t30
            t26: i32 = srl # D:1 t24, Constant:i32<20>
          t37: i32 = or # D:1 t35, t26
        t38: i32 = and # D:1 t37, Constant:i32<1023>
      t15: i1 = setcc # D:1 t38, Constant:i32<0>, seteq:ch
    t16: i32 = zero_extend # D:1 t15
  t19: ch,glue = CopyToReg # D:1 t0, Register:i32 $vgpr0, t16
  t24: i32,ch = CopyFromReg # D:1 t0, Register:i32 %8
  t20: ch = RET_GLUE t19, Register:i32 $vgpr0, t19:1

I'd like to find a general transform that applies here. I think we could do (x | x >> 10) into (x - 1023) (1023 = 10 rightmost bits set), but I think having a sub here (or an add) doesn't make it any easier.

Another option that crossed my mind is whether we could simply add a new builtin that checks if we're in the workitem/workgroup 0,0,0, and just use that in device libs. Or more generically, add a builtin that can check any coordinate + add a combine for the 0,0,0 case.

Then device lib could simply do something like

return __builtin_amdgcn_check_workitem_id(0, 0, 0);

That'd provide an optimal way to check for specific work-item IDs, and if we know more about the intent we can optimize this more freely. When all the coords are constants we can precalculate the expected value of the register for example

arsenm · 2025-05-08T12:20:24Z

This should not be a special builtin. That's worse than just special casing the intrinsic in the optimization and then requires source changes

jayfoad · 2025-05-08T14:22:21Z

So to handle this as a DAG combine you'd need to match patterns like:

((x << C1 | x << C2 | ...) & C) == 0  -->  (x & (C >> C1 | C >> C2 | ...)) == 0

(And == could be != instead, and the shifts could be in either direction, and one of the shift amounts might be zero.)

Note that this is still beneficial even if there is only one shift, and we don't seem to have a DAG combine even for this case:

((x << C1) & C) == 0  -->  (x & (C >> C1)) == 0

jayfoad · 2025-05-09T08:29:41Z

So to handle this as a DAG combine you'd need to match patterns like:
((x << C1 | x << C2 | ...) & C) == 0  -->  (x & (C >> C1 | C >> C2 | ...)) == 0

See also #139165 which matches a vaguely similar pattern.

arsenm added backend:AMDGPU missed-optimization labels Apr 22, 2025

Pierre-vh self-assigned this May 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AMDGPU misses optimization on check-all-workitem-ids are 0 pattern #136727

AMDGPU misses optimization on check-all-workitem-ids are 0 pattern #136727

arsenm commented Apr 22, 2025

llvmbot commented Apr 22, 2025

Uh oh!

Pierre-vh commented May 1, 2025

Uh oh!

arsenm commented May 1, 2025

Uh oh!

Pierre-vh commented May 7, 2025

Uh oh!

jayfoad commented May 7, 2025

Uh oh!

cdevadas commented May 7, 2025

Uh oh!

Pierre-vh commented May 7, 2025

Uh oh!

jayfoad commented May 7, 2025

Uh oh!

shiltian commented May 7, 2025 •

edited

Loading

Uh oh!

Pierre-vh commented May 8, 2025

Uh oh!

arsenm commented May 8, 2025 •

edited

Loading

Uh oh!

jayfoad commented May 8, 2025

Uh oh!

jayfoad commented May 9, 2025

Uh oh!

AMDGPU misses optimization on check-all-workitem-ids are 0 pattern #136727

AMDGPU misses optimization on check-all-workitem-ids are 0 pattern #136727

Comments

arsenm commented Apr 22, 2025

llvmbot commented Apr 22, 2025

Uh oh!

Pierre-vh commented May 1, 2025

Uh oh!

arsenm commented May 1, 2025

Uh oh!

Pierre-vh commented May 7, 2025

Uh oh!

jayfoad commented May 7, 2025

Uh oh!

cdevadas commented May 7, 2025

Uh oh!

Pierre-vh commented May 7, 2025

Uh oh!

jayfoad commented May 7, 2025

Uh oh!

shiltian commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Pierre-vh commented May 8, 2025

Uh oh!

arsenm commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jayfoad commented May 8, 2025

Uh oh!

jayfoad commented May 9, 2025

Uh oh!

shiltian commented May 7, 2025 •

edited

Loading

arsenm commented May 8, 2025 •

edited

Loading