-
-
Notifications
You must be signed in to change notification settings - Fork 22.8k
Metal: Use image atomic operations on supported Apple hardware #108028
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8f8de15
to
d262baa
Compare
d262baa
to
c8ebe5a
Compare
Fixed missing constant binding, so it should pass CI/CD now |
thirdparty/README.md
Outdated
@@ -990,6 +990,10 @@ Files extracted from upstream source: | |||
Versions of this SDK do not have to match the `vulkan` section, as this SDK is required | |||
to generate Metal source from Vulkan SPIR-V. | |||
|
|||
Patches: | |||
|
|||
- `0001-fix-argument-buffer-access.patch` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- `0001-fix-argument-buffer-access.patch` | |
- `0001-fix-argument-buffer-access.patch` (GH-108028) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the patch and updated SPIRV-Cross, as it has been fixed
c8ebe5a
to
6d8947b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested locally, it works as expected. Code looks good to me.
Testing project: test-atomic-performance.zip
However, I can't reproduce any performance improvement on the testing project linked above (a very simple scene with 4,096 FogVolumes using the fallback material). In fact, I get less FPS now:
Metal master |
MoltenVK master |
Metal PR |
---|---|---|
95 FPS | 91 FPS | 80 FPS |
For reference, the scene with volumetric fog disabled (and no FogVolumes) gives me 768 FPS.
Mac specifications
- MacBook Pro 16 2024
- SoC: M4 Max 16-core CPU, 40-core GPU
- RAM: 48 GB
- SSD: 1 TB
- OS: macOS 15.5
@Calinou I have been seeing the same thing; and initially I thought it was a difference between 4.4 and 4.5 in general, but I'm also seeing the difference between main and this branch. I added an environment variable so that I could change between buffers and textures on the fly and I'm seeing 78fps with atomic operations on textures and 92fps with atomic operations on buffers. π€·π» I enabled the Vulkan memory model in the Fog shaders, to remove the
Whilst it did remove the Taking a step back, I'm now not to sure why the storage buffer version would be slower, given it also uses atomic operations? It seems that there may be different behaviour regarding coherency between textures and buffers that may be the difference. |
More data to support the difference. This image is a GPU capture using buffer atomic operations. The The following image is a GPU capture using image atomic operations. The popup shows the breakdown of instructions for the line that performs the |
@clayjohn / @Calinou Why is the atomic fetches needed at all? All the read / write operations use All the dispatch calls are to a single command buffer, so they can't execute in parallel either. Instead of: imageAtomicAdd(density_only_map, pos, final_density); Couldn't it be: imageStore(density_only_map, pos, imageLoad(density_only_map, pos) + final_density); |
@stuartcarnie on AMD all the dispatches are processed in parellel. Using load/store only works if you put a barrier between each dispatch I'm surprised to hear that the dispatches aren't executed in parallel though. On Apple silicon they definitely are for vertex and fragment workloads |
Thanks! You sent me down a long research rabbit hold which eventually lead me to finding that the default behaviour of a Metal Given Godot's graphics API is built on Vulkan with those assumptions of memory model and execution, I can safely switch to the concurrent dispatch type. In doing so, my FPS changed as follows:
We can see the effects in the GPU profile too. The following is an image of the timeline of the volumetric fog shader dispatch calls. Note all the gaps: Now look at the following with concurrent execution enabled: |
6d8947b
to
a435646
Compare
if (OS::get_singleton()->get_environment("GODOT_MTL_DISABLE_IMAGE_ATOMICS") == "1") { | ||
features.supports_image_atomic_32_bit = false; | ||
features.supports_image_atomic_64_bit = false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allow image atomics to be disabled, which is useful for comparing performance.
Note
You will need to delete the Volumetric shaders from the cache, as they require regeneration
drivers/metal/metal_objects.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two changes to the MDCommandBuffer
:
- Like uniform state, push constants are captured and only bound to the pipeline state when the render or compute shader is executed.
- Compute shaders now capture uniform state, like render shaders, so that they only bind changed uniforms sets, which reduces the number of API calls to Metal by 1,000s of calls, especially for something like the Fog stress tests, where compute dispatch and uniform sets are bound for every fog volume.
- Further, compute shaders are only created if dispatched, which removes some "empty encoder" warnings from Metal Graphics Debugger in Xcode
@@ -2417,6 +2445,7 @@ bool isArrayTexture(MTLTextureType p_type) { | |||
|
|||
MTLComputePipelineDescriptor *desc = [MTLComputePipelineDescriptor new]; | |||
desc.computeFunction = function; | |||
desc.label = conv::to_nsstring(shader->name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really handy for debugging, as now we can see all the shader names in the Metal Graphics Debugger in Xcode
RD::get_singleton()->set_resource_name(emissive_map, "Fog emissive map"); | ||
} else { | ||
tf.format = RD::DATA_FORMAT_R32_UINT; | ||
tf.usage_bits = RD::TEXTURE_USAGE_STORAGE_BIT | RD::TEXTURE_USAGE_CAN_COPY_TO_BIT | RD::TEXTURE_USAGE_STORAGE_ATOMIC_BIT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We added TEXTURE_USAGE_STORAGE_ATOMIC_BIT
as the engine uses this to validate the texture format supports image atomics and can inform drivers, like Metal, that the texture will be used for atomic image operations
#pragma use_vulkan_memory_model | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This informs SPIRV-Cross to use the vulkan memory model when generating SPIRV, which includes more fine-grained control of coherency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find much about this online. Do you have a link to somewhere with a clear explanation of what behaviour changes as a result of enabling this? Discussions on gpuweb/gpuweb#2377 seem to indicate that the changes made to the SPIRV as a result of this pragma are basically ineffective with Vulkan without enabling some Vulkan memory model stuff on the API side.
Does this benefit SPIRV-Cross/our MSL transpilation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is what I found too. It only adds some directives inside the SPIR-V, and for the Vulkan Driver, the Vulkan memory model is already assumed. So this informs SPIR-V translation that the GLSL is using the Vulkan Memory Model.
a435646
to
283cacc
Compare
@Calinou for some reason your test-atomic-performance.zip returns a 404. I would run my test, as it ensures the fog volume has density, to force some atomic usage. EditWell, when I pasted it into this comment, it worked π |
I tested performance again on the latest revision of this PR with the MRP from #108028 (review):
However, interestingly, I can't exit Godot in the MRP anymore with this PR even though I can in
darksylinc encountered the same issue in #108127 (comment). I'm not sure what's causing it. |
Which MRP are you using? I ran my test project on my M4 and M1 MacBook Pros and was able to quit using the traffic lights and CTRL+C π€ |
That is a pretty big jump! |
@clayjohn (or anyone else) can you reproduce @Calinou's issue:
I've tried it on two different machines and I am unable to reproduce and can exit fine either closing the window or pressing CTRL+C:
With |
283cacc
to
5455d64
Compare
UpdateFalse alarm, it was my mistake as I forgot to include the change to It meant that the semaphore was not signalled when the command buffer was |
5455d64
to
e5b8cc2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works great on my end now π
Thanks! Appreciate your patience too! |
6f678c0
to
59a2f43
Compare
Co-authored-by: A Thousand Ships <[email protected]>
59a2f43
to
5230f6c
Compare
Thanks @AThousandShips β all your changes are incorporated! |
Thanks! |
Metal is capable of atomic image operations on supported hardware, so this PR enables it at runtime when the support is available.
ποΈ Changes
RenderingDevice
π Performance
4.5 / main with M4 MacBook Pro Max
Using the Fog Stress project with
--fog-volumes 5000
The following chart shows FPS percentiles and GPU time percentiles. This PR is almost 2x!
The following chart is GPU frame time, over time, and also shows a very consistent 2x improvement.
Finally, note that atomic operations on images or buffers appears to be similar.
Godot Reflection (Tesseract) Benchmark
Also worth noting that other benchmarks show a GPU time improvement too.
4.5 / main with M1 MacBook Pro Max
Using the Fog Stress project with
--fog-volumes 5000
The following chart shows FPS percentiles and GPU time percentiles. The difference is more significant.
The following chart is GPU frame time, over time, and also shows a very consistent 7x improvement.
Why is M1 faster in these tests?
My M4 is running multiple displays, which is affecting FPS, the M1 is only running built-in display.
π₯Ό Testing
fog_stress.zip
Note
To disable image atomics, you can specify the following environment variable when launching Godot:
Volumetric Fog from Godot Demo Projects
Important
This would supersede @akien-mga's update of SPIRV-Cross in #107773, as I had to patch it to fix the MSL generation. A
.patch
file is included and an issue opened for the SPIRV-Cross project.