-
Notifications
You must be signed in to change notification settings - Fork 13.6k
AMDGPU misses optimization on check-all-workitem-ids are 0 pattern #136727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@llvm/issue-subscribers-backend-amdgpu Author: Matt Arsenault (arsenm)
The device libraries include this pattern to check if all workitem IDs are 0.
This is equivalent to checking x == 0 && y == 0 && z == 0. If we codegen this, we see:
In the function ABI, the work item IDs are packed into v31. We should be able to just check v31 == 0, so this would shrink to
|
I want to look at this one, but I have a few questions:
|
Yes, isel. We could also potentially do this in one of the backend IR passes, e.g. AMDGPUCodeGenPrepare.
It's probably easiest to do this before the workitem IDs are lowered. We could also do similar against mbcnt We probably should guarantee this in the ABI. In practice I don't see how we would end up with undefined high bits, so we should assume they are 0 |
Where is that documented ? For this optimization to work we must guarantee the two top bits are zero |
See for example https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna4-instruction-set-architecture.pdf section 3.5.4. "VGPR Initialization":
This seems pretty clear that the top two bits are guaranteed to be zero, at least for GFX12. However, even without that guarantee, we can incrementally improve the codegen for a sequence that extracts two adjacent bitfields from the same source, ORs them together and compares with zero. You can pattern match that and optimize it to a single extract of the combined field. |
This seems like a better alternative to accommodate more similar cases than just our workItemIDs. Can't this be written as a common optimization rather than restricting it to the AMDGPU target? |
Yeah that's what I want to do ideally. I'd prefer to not special case this to the workgroup intrinsic but make it a general pattern instead, but then there's the challenge of looking through the shift/ands (that extract the bits) after the lowering of the workgroup intrinsic |
GlobalISel has G_SBFX and G_UBFX. Maybe SelectionDAG should use similar high level bitfield extract ops. |
You'd need to know that the bits can form a complete integer, rather than leaving some bits unfilled or unused. That needs some priori knowledge. |
There is also this pattern below which is even more annoying to optimize:
I don't really like the idea of optimizing this directly on the intrinsic because I think it doesn't generalize well enough; it feels like the wrong place to fix it. If I add a simple combine to get rid of
I'd like to find a general transform that applies here. I think we could do Another option that crossed my mind is whether we could simply add a new builtin that checks if we're in the workitem/workgroup Then device lib could simply do something like
That'd provide an optimal way to check for specific work-item IDs, and if we know more about the intent we can optimize this more freely. When all the coords are constants we can precalculate the expected value of the register for example |
This should not be a special builtin. That's worse than just special casing the intrinsic in the optimization and then requires source changes |
So to handle this as a DAG combine you'd need to match patterns like:
(And Note that this is still beneficial even if there is only one shift, and we don't seem to have a DAG combine even for this case:
|
See also #139165 which matches a vaguely similar pattern. |
The device libraries include this pattern to check if all workitem IDs are 0.
https://github.com/ROCm/llvm-project/blob/662bae8d56ae5ba900a81b468936f47769b0fc2d/amd/device-libs/ockl/src/cg.cl#L46
This is equivalent to checking x == 0 && y == 0 && z == 0. If we codegen this, we see:
In the function ABI, the work item IDs are packed into v31. We should be able to just check v31 == 0, so this would shrink to
The text was updated successfully, but these errors were encountered: