Skip to content

AMDGPU misses optimization on check-all-workitem-ids are 0 pattern #136727

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
arsenm opened this issue Apr 22, 2025 · 13 comments
Open

AMDGPU misses optimization on check-all-workitem-ids are 0 pattern #136727

arsenm opened this issue Apr 22, 2025 · 13 comments

Comments

@arsenm
Copy link
Contributor

arsenm commented Apr 22, 2025

The device libraries include this pattern to check if all workitem IDs are 0.

// RUN: clang -target amdgcn-amd-amdhsa -S -O3 -mcpu=gfx900 -nogpulib < %s
bool
choose_one_workgroup_workitem(void)
{
    return (__builtin_amdgcn_workitem_id_x() | __builtin_amdgcn_workitem_id_y() | __builtin_amdgcn_workitem_id_z()) == 0;
}

https://github.com/ROCm/llvm-project/blob/662bae8d56ae5ba900a81b468936f47769b0fc2d/amd/device-libs/ockl/src/cg.cl#L46

This is equivalent to checking x == 0 && y == 0 && z == 0. If we codegen this, we see:

	v_and_b32_e32 v0, 0x3ff, v31
	v_bfe_u32 v1, v31, 20, 10
	v_bfe_u32 v2, v31, 10, 10
	v_or3_b32 v0, v0, v2, v1
	v_cmp_eq_u32_e32 vcc, 0, v0
	v_cndmask_b32_e64 v0, 0, 1, vcc
	s_setpc_b64 s[30:31]

In the function ABI, the work item IDs are packed into v31. We should be able to just check v31 == 0, so this would shrink to

	v_cmp_eq_u32_e32 vcc, 0, v31
	v_cndmask_b32_e64 v0, 0, 1, vcc
	s_setpc_b64 s[30:31]

@llvmbot
Copy link
Member

llvmbot commented Apr 22, 2025

@llvm/issue-subscribers-backend-amdgpu

Author: Matt Arsenault (arsenm)

The device libraries include this pattern to check if all workitem IDs are 0.
// RUN: clang -target amdgcn-amd-amdhsa -S -O3 -mcpu=gfx900 -nogpulib &lt; %s
bool
choose_one_workgroup_workitem(void)
{
    return (__builtin_amdgcn_workitem_id_x() | __builtin_amdgcn_workitem_id_y() | __builtin_amdgcn_workitem_id_z()) == 0;
}

https://github.com/ROCm/llvm-project/blob/662bae8d56ae5ba900a81b468936f47769b0fc2d/amd/device-libs/ockl/src/cg.cl#L46

This is equivalent to checking x == 0 && y == 0 && z == 0. If we codegen this, we see:

	v_and_b32_e32 v0, 0x3ff, v31
	v_bfe_u32 v1, v31, 20, 10
	v_bfe_u32 v2, v31, 10, 10
	v_or3_b32 v0, v0, v2, v1
	v_cmp_eq_u32_e32 vcc, 0, v0
	v_cndmask_b32_e64 v0, 0, 1, vcc
	s_setpc_b64 s[30:31]

In the function ABI, the work item IDs are packed into v31. We should be able to just check v31 == 0, so this would shrink to

	v_cmp_eq_u32_e32 vcc, 0, v31
	v_cndmask_b32_e64 v0, 0, 1, vcc
	s_setpc_b64 s[30:31]

@Pierre-vh Pierre-vh self-assigned this May 1, 2025
@Pierre-vh
Copy link
Contributor

I want to look at this one, but I have a few questions:

  • Should we optimize this in ISel or after in some other pass like SIFoldOperands? I feel like ISel fits better.
  • Do we have any utils that can look through shift/and to find the underlying value + bits extracted?
  • There's 3 blocks of 10 bits in the register, I assume remaining bits are known to be zero? Otherwise we may still end up with a mask for the lower 30 bits

@arsenm
Copy link
Contributor Author

arsenm commented May 1, 2025

  • Should we optimize this in ISel or after in some other pass like SIFoldOperands? I feel like ISel fits better.

Yes, isel. We could also potentially do this in one of the backend IR passes, e.g. AMDGPUCodeGenPrepare.

There's 3 blocks of 10 bits in the register, I assume remaining bits are known to be zero? Otherwise we may still end up with a mask for the lower 30 bits

Do we have any utils that can look through shift/and to find the underlying value + bits extracted?

It's probably easiest to do this before the workitem IDs are lowered. We could also do similar against mbcnt

We probably should guarantee this in the ABI. In practice I don't see how we would end up with undefined high bits, so we should assume they are 0

@Pierre-vh
Copy link
Contributor

We probably should guarantee this in the ABI. In practice I don't see how we would end up with undefined high bits, so we should assume they are 0

Where is that documented ? For this optimization to work we must guarantee the two top bits are zero
Otherwise it needs a mask of the two top bits

@jayfoad
Copy link
Contributor

jayfoad commented May 7, 2025

Where is that documented ? For this optimization to work we must guarantee the two top bits are zero Otherwise it needs a mask of the two top bits

See for example https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna4-instruction-set-architecture.pdf section 3.5.4. "VGPR Initialization":

Stage CS
VGPR0 {2’b0, Z[9:0], Y[9:0], X[9:0]}
X, Y, and Z work-item position within the workgroup. Work-item 0 is at (0,0,0).

This seems pretty clear that the top two bits are guaranteed to be zero, at least for GFX12.

However, even without that guarantee, we can incrementally improve the codegen for a sequence that extracts two adjacent bitfields from the same source, ORs them together and compares with zero. You can pattern match that and optimize it to a single extract of the combined field.

@cdevadas
Copy link
Collaborator

cdevadas commented May 7, 2025

However, even without that guarantee, we can incrementally improve the codegen for a sequence that extracts two adjacent bitfields from the same source, ORs them together and compares with zero. You can pattern match that and optimize it to a single extract of the combined field.

This seems like a better alternative to accommodate more similar cases than just our workItemIDs. Can't this be written as a common optimization rather than restricting it to the AMDGPU target?

@Pierre-vh
Copy link
Contributor

However, even without that guarantee, we can incrementally improve the codegen for a sequence that extracts two adjacent bitfields from the same source,

Yeah that's what I want to do ideally. I'd prefer to not special case this to the workgroup intrinsic but make it a general pattern instead, but then there's the challenge of looking through the shift/ands (that extract the bits) after the lowering of the workgroup intrinsic

@jayfoad
Copy link
Contributor

jayfoad commented May 7, 2025

but then there's the challenge of looking through the shift/ands (that extract the bits) after the lowering of the workgroup intrinsic

GlobalISel has G_SBFX and G_UBFX. Maybe SelectionDAG should use similar high level bitfield extract ops.

@shiltian
Copy link
Contributor

shiltian commented May 7, 2025

This seems like a better alternative to accommodate more similar cases than just our workItemIDs. Can't this be written as a common optimization rather than restricting it to the AMDGPU target?

You'd need to know that the bits can form a complete integer, rather than leaving some bits unfilled or unused. That needs some priori knowledge.

@Pierre-vh
Copy link
Contributor

There is also this pattern below which is even more annoying to optimize:

bool
choose_one_grid_workitem(void)
{
    return (__builtin_amdgcn_workitem_id_x() | __builtin_amdgcn_workgroup_id_x() |
            __builtin_amdgcn_workitem_id_y() | __builtin_amdgcn_workgroup_id_y() |
            __builtin_amdgcn_workitem_id_z() | __builtin_amdgcn_workgroup_id_z()) == 0;
}

I don't really like the idea of optimizing this directly on the intrinsic because I think it doesn't generalize well enough; it feels like the wrong place to fix it.

If I add a simple combine to get rid of AssertZExt when the source is a suitable and, we can get this in the DAG for the original case:

SelectionDAG has 18 nodes:
  t0: ch,glue = EntryToken
              t30: i32 = srl # D:1 t24, Constant:i32<10>
            t35: i32 = or # D:1 t24, t30
            t26: i32 = srl # D:1 t24, Constant:i32<20>
          t37: i32 = or # D:1 t35, t26
        t38: i32 = and # D:1 t37, Constant:i32<1023>
      t15: i1 = setcc # D:1 t38, Constant:i32<0>, seteq:ch
    t16: i32 = zero_extend # D:1 t15
  t19: ch,glue = CopyToReg # D:1 t0, Register:i32 $vgpr0, t16
  t24: i32,ch = CopyFromReg # D:1 t0, Register:i32 %8
  t20: ch = RET_GLUE t19, Register:i32 $vgpr0, t19:1

I'd like to find a general transform that applies here. I think we could do (x | x >> 10) into (x - 1023) (1023 = 10 rightmost bits set), but I think having a sub here (or an add) doesn't make it any easier.

Another option that crossed my mind is whether we could simply add a new builtin that checks if we're in the workitem/workgroup 0,0,0, and just use that in device libs. Or more generically, add a builtin that can check any coordinate + add a combine for the 0,0,0 case.

Then device lib could simply do something like

return __builtin_amdgcn_check_workitem_id(0, 0, 0);

That'd provide an optimal way to check for specific work-item IDs, and if we know more about the intent we can optimize this more freely. When all the coords are constants we can precalculate the expected value of the register for example

@arsenm
Copy link
Contributor Author

arsenm commented May 8, 2025

This should not be a special builtin. That's worse than just special casing the intrinsic in the optimization and then requires source changes

@jayfoad
Copy link
Contributor

jayfoad commented May 8, 2025

So to handle this as a DAG combine you'd need to match patterns like:

((x << C1 | x << C2 | ...) & C) == 0  -->  (x & (C >> C1 | C >> C2 | ...)) == 0

(And == could be != instead, and the shifts could be in either direction, and one of the shift amounts might be zero.)

Note that this is still beneficial even if there is only one shift, and we don't seem to have a DAG combine even for this case:

((x << C1) & C) == 0  -->  (x & (C >> C1)) == 0

@jayfoad
Copy link
Contributor

jayfoad commented May 9, 2025

So to handle this as a DAG combine you'd need to match patterns like:

((x << C1 | x << C2 | ...) & C) == 0  -->  (x & (C >> C1 | C >> C2 | ...)) == 0

See also #139165 which matches a vaguely similar pattern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants