Skip to content

Metal: Use image atomic operations on supported Apple hardware #108028

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 7, 2025

Conversation

stuartcarnie
Copy link
Contributor

@stuartcarnie stuartcarnie commented Jun 27, 2025

Metal is capable of atomic image operations on supported hardware, so this PR enables it at runtime when the support is available.

πŸ—’οΈ Changes

  • Added image atomics feature to RenderingDevice
  • πŸš€ Changed default compute encoder to execute dispatch calls concurrently
  • Updated SPIRV-Cross to include atomic fixes
  • Reduce CPU overhead in Metal driver by avoiding unnecessary API calls
    • Reduces API warnings when running Godot under Xcode Metal Debugger for unnecessary state changes

πŸš€ Performance

4.5 / main with M4 MacBook Pro Max

Using the Fog Stress project with --fog-volumes 5000

  • Atomics PR (enabled): this PR with image atomics enabled for fog volumes (default)
  • Atomics PR (disabled): this PR with image atomics disabled via environment variable
  • Beta 1

The following chart shows FPS percentiles and GPU time percentiles. This PR is almost 2x!

percentile_comparison

The following chart is GPU frame time, over time, and also shows a very consistent 2x improvement.

time_series_gpu_time_ms

Finally, note that atomic operations on images or buffers appears to be similar.

Godot Reflection (Tesseract) Benchmark

Also worth noting that other benchmarks show a GPU time improvement too.

percentile_comparison

4.5 / main with M1 MacBook Pro Max

Using the Fog Stress project with --fog-volumes 5000

The following chart shows FPS percentiles and GPU time percentiles. The difference is more significant.

percentile_comparison

The following chart is GPU frame time, over time, and also shows a very consistent 7x improvement.

time_series_gpu_time_ms

Why is M1 faster in these tests?

My M4 is running multiple displays, which is affecting FPS, the M1 is only running built-in display.

πŸ₯Ό Testing

fog_stress.zip

Note

To disable image atomics, you can specify the following environment variable when launching Godot:

GODOT_METAL_DISABLE_IMAGE_ATOMICS=1

Volumetric Fog from Godot Demo Projects

  • βœ… MacBook Pro M4
  • βœ… iPhone 12 Pro (Apple6 GPU)

Important

This would supersede @akien-mga's update of SPIRV-Cross in #107773, as I had to patch it to fix the MSL generation. A .patch file is included and an issue opened for the SPIRV-Cross project.

@stuartcarnie
Copy link
Contributor Author

Fixed missing constant binding, so it should pass CI/CD now

@@ -990,6 +990,10 @@ Files extracted from upstream source:
Versions of this SDK do not have to match the `vulkan` section, as this SDK is required
to generate Metal source from Vulkan SPIR-V.

Patches:

- `0001-fix-argument-buffer-access.patch`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `0001-fix-argument-buffer-access.patch`
- `0001-fix-argument-buffer-access.patch` (GH-108028)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the patch and updated SPIRV-Cross, as it has been fixed

@stuartcarnie stuartcarnie force-pushed the apple_image_atomics branch from c8ebe5a to 6d8947b Compare June 28, 2025 23:26
Copy link
Member

@Calinou Calinou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally, it works as expected. Code looks good to me.

Testing project: test-atomic-performance.zip

However, I can't reproduce any performance improvement on the testing project linked above (a very simple scene with 4,096 FogVolumes using the fallback material). In fact, I get less FPS now:

Metal master MoltenVK master Metal PR
95 FPS 91 FPS 80 FPS

For reference, the scene with volumetric fog disabled (and no FogVolumes) gives me 768 FPS.

Mac specifications
  • MacBook Pro 16 2024
  • SoC: M4 Max 16-core CPU, 40-core GPU
  • RAM: 48 GB
  • SSD: 1 TB
  • OS: macOS 15.5

@stuartcarnie
Copy link
Contributor Author

stuartcarnie commented Jul 1, 2025

However, I can't reproduce any performance improvement on the testing project linked above (a very simple scene with 4,096 FogVolumes using the fallback material). In fact, I get less FPS now

@Calinou I have been seeing the same thing; and initially I thought it was a difference between 4.4 and 4.5 in general, but I'm also seeing the difference between main and this branch.

I added an environment variable so that I could change between buffers and textures on the fly and I'm seeing 78fps with atomic operations on textures and 92fps with atomic operations on buffers. 🀷🏻

I enabled the Vulkan memory model in the Fog shaders, to remove the memory_coherency_device qualifier from the texture bindings:

#pragma use_vulkan_memory_model

Whilst it did remove the memory_coherence_device from the texture biding in the shader, it didn't make a difference to FPS.

Taking a step back, I'm now not to sure why the storage buffer version would be slower, given it also uses atomic operations? It seems that there may be different behaviour regarding coherency between textures and buffers that may be the difference.

@stuartcarnie
Copy link
Contributor Author

More data to support the difference.

This image is a GPU capture using buffer atomic operations. The atomic_fetch_add_explicit isn't even showing up.

CleanShot 2025-07-02 at 07 22 12@2x

The following image is a GPU capture using image atomic operations. The popup shows the breakdown of instructions for the line that performs the atomic_fetch_and_add on the texture. It costs 17% of the entire function:

CleanShot 2025-07-02 at 07 22 27@2x

@stuartcarnie
Copy link
Contributor Author

@clayjohn / @Calinou Why is the atomic fetches needed at all? All the read / write operations use pos, which is the thread's position for the dispatch call, and pos isn't mutated in any way, which means that it reads / writes to the location defined by gl_GlobalInvocationID.

All the dispatch calls are to a single command buffer, so they can't execute in parallel either.

Instead of:

imageAtomicAdd(density_only_map, pos, final_density);

Couldn't it be:

imageStore(density_only_map, pos, imageLoad(density_only_map, pos) + final_density);

@clayjohn
Copy link
Member

clayjohn commented Jul 2, 2025

@stuartcarnie on AMD all the dispatches are processed in parellel. Using load/store only works if you put a barrier between each dispatch

I'm surprised to hear that the dispatches aren't executed in parallel though. On Apple silicon they definitely are for vertex and fragment workloads

@stuartcarnie
Copy link
Contributor Author

on AMD all the dispatches are processed in parellel. Using load/store only works if you put a barrier between each dispatch

Thanks! You sent me down a long research rabbit hold which eventually lead me to finding that the default behaviour of a Metal MTLComputeCommandEncoder is serial execution of dispatch calls; however, you can construct a command encoder with alternative dispatch types, so that dispatch commands are executed concurrently:

https://developer.apple.com/documentation/metal/mtlcommandbuffer/makecomputecommandencoder(dispatchtype:)?language=objc

Given Godot's graphics API is built on Vulkan with those assumptions of memory model and execution, I can safely switch to the concurrent dispatch type. In doing so, my FPS changed as follows:

shader mode serial dispatch (default) concurrent dispatch
image atomics 79 118
buffer atomics 91 118

We can see the effects in the GPU profile too. The following is an image of the timeline of the volumetric fog shader dispatch calls. Note all the gaps:

CleanShot 2025-07-02 at 18 04 53@2x

Now look at the following with concurrent execution enabled:

CleanShot 2025-07-02 at 18 05 50@2x

@stuartcarnie stuartcarnie force-pushed the apple_image_atomics branch from 6d8947b to a435646 Compare July 2, 2025 19:36
Comment on lines +126 to +129
if (OS::get_singleton()->get_environment("GODOT_MTL_DISABLE_IMAGE_ATOMICS") == "1") {
features.supports_image_atomic_32_bit = false;
features.supports_image_atomic_64_bit = false;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allow image atomics to be disabled, which is useful for comparing performance.

Note

You will need to delete the Volumetric shaders from the cache, as they require regeneration

Copy link
Contributor Author

@stuartcarnie stuartcarnie Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two changes to the MDCommandBuffer:

  • Like uniform state, push constants are captured and only bound to the pipeline state when the render or compute shader is executed.
  • Compute shaders now capture uniform state, like render shaders, so that they only bind changed uniforms sets, which reduces the number of API calls to Metal by 1,000s of calls, especially for something like the Fog stress tests, where compute dispatch and uniform sets are bound for every fog volume.
  • Further, compute shaders are only created if dispatched, which removes some "empty encoder" warnings from Metal Graphics Debugger in Xcode

@@ -2417,6 +2445,7 @@ bool isArrayTexture(MTLTextureType p_type) {

MTLComputePipelineDescriptor *desc = [MTLComputePipelineDescriptor new];
desc.computeFunction = function;
desc.label = conv::to_nsstring(shader->name);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really handy for debugging, as now we can see all the shader names in the Metal Graphics Debugger in Xcode

RD::get_singleton()->set_resource_name(emissive_map, "Fog emissive map");
} else {
tf.format = RD::DATA_FORMAT_R32_UINT;
tf.usage_bits = RD::TEXTURE_USAGE_STORAGE_BIT | RD::TEXTURE_USAGE_CAN_COPY_TO_BIT | RD::TEXTURE_USAGE_STORAGE_ATOMIC_BIT;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We added TEXTURE_USAGE_STORAGE_ATOMIC_BIT as the engine uses this to validate the texture format supports image atomics and can inform drivers, like Metal, that the texture will be used for atomic image operations

Comment on lines +5 to +6
#pragma use_vulkan_memory_model

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This informs SPIRV-Cross to use the vulkan memory model when generating SPIRV, which includes more fine-grained control of coherency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find much about this online. Do you have a link to somewhere with a clear explanation of what behaviour changes as a result of enabling this? Discussions on gpuweb/gpuweb#2377 seem to indicate that the changes made to the SPIRV as a result of this pragma are basically ineffective with Vulkan without enabling some Vulkan memory model stuff on the API side.

Does this benefit SPIRV-Cross/our MSL transpilation?

Copy link
Contributor Author

@stuartcarnie stuartcarnie Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is what I found too. It only adds some directives inside the SPIR-V, and for the Vulkan Driver, the Vulkan memory model is already assumed. So this informs SPIR-V translation that the GLSL is using the Vulkan Memory Model.

@stuartcarnie stuartcarnie force-pushed the apple_image_atomics branch from a435646 to 283cacc Compare July 2, 2025 19:47
@clayjohn clayjohn modified the milestones: 4.x, 4.6 Jul 2, 2025
@stuartcarnie
Copy link
Contributor Author

stuartcarnie commented Jul 2, 2025

@Calinou for some reason your test-atomic-performance.zip returns a 404.

I would run my test, as it ensures the fog volume has density, to force some atomic usage.

Edit

Well, when I pasted it into this comment, it worked πŸ™„

@Calinou
Copy link
Member

Calinou commented Jul 3, 2025

I tested performance again on the latest revision of this PR with the MRP from #108028 (review):

  • master: 98 FPS
  • This PR: 570 FPS (fluctuates a fair bit though, it's between 300 and 570 FPS).

However, interestingly, I can't exit Godot in the MRP anymore with this PR even though I can in master (Ctrl + C also doesn't work). I need to force quit it instead.

@Calinou for some reason your test-atomic-performance.zip returns a 404.

darksylinc encountered the same issue in #108127 (comment). I'm not sure what's causing it.

@stuartcarnie
Copy link
Contributor Author

However, interestingly, I can't exit Godot in the MRP anymore with this PR even though I can in master (Ctrl + C also doesn't work). I need to force quit it instead.

Which MRP are you using?

I ran my test project on my M4 and M1 MacBook Pros and was able to quit using the traffic lights and CTRL+C πŸ€”

@stuartcarnie
Copy link
Contributor Author

  • master: 98 FPS
  • This PR: 570 FPS (fluctuates a fair bit though, it's between 300 and 570 FPS).

That is a pretty big jump!

@stuartcarnie
Copy link
Contributor Author

stuartcarnie commented Jul 3, 2025

@clayjohn (or anyone else) can you reproduce @Calinou's issue:

However, interestingly, I can't exit Godot in the MRP anymore with this PR even though I can in master (Ctrl + C also doesn't work). I need to force quit it instead.

I've tried it on two different machines and I am unable to reproduce and can exit fine either closing the window or pressing CTRL+C:

With fog_stress, you should launch with --fog-volumes 5000 to add 5000 volumes to the scene.

@stuartcarnie stuartcarnie force-pushed the apple_image_atomics branch from 283cacc to 5455d64 Compare July 3, 2025 22:09
@stuartcarnie
Copy link
Contributor Author

Update

False alarm, it was my mistake as I forgot to include the change to rendering_device_driver_metal.mm in this diff:

https://github.com/godotengine/godot/compare/283cacc39f8e9ba12138ea8b59e03cf9fb3c6ebf..5455d643311e2113ae3f993d5effb9c74abacf38

It meant that the semaphore was not signalled when the command buffer was nil, and so the process would hang.

@stuartcarnie stuartcarnie force-pushed the apple_image_atomics branch from 5455d64 to e5b8cc2 Compare July 3, 2025 23:48
Copy link
Member

@Calinou Calinou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works great on my end now πŸ™‚

@stuartcarnie
Copy link
Contributor Author

Works great on my end now πŸ™‚

Thanks! Appreciate your patience too!

@clayjohn clayjohn modified the milestones: 4.6, 4.5 Jul 4, 2025
@stuartcarnie stuartcarnie force-pushed the apple_image_atomics branch from 6f678c0 to 59a2f43 Compare July 4, 2025 19:58
@stuartcarnie stuartcarnie force-pushed the apple_image_atomics branch from 59a2f43 to 5230f6c Compare July 4, 2025 19:59
@stuartcarnie
Copy link
Contributor Author

Thanks @AThousandShips – all your changes are incorporated!

@Repiteo Repiteo merged commit db1b6b9 into godotengine:master Jul 7, 2025
20 checks passed
@Repiteo
Copy link
Contributor

Repiteo commented Jul 7, 2025

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants