Metal: Use image atomic operations on supported Apple hardware #108028

stuartcarnie · 2025-06-27T00:00:34Z

Metal is capable of atomic image operations on supported hardware, so this PR enables it at runtime when the support is available.

🗒️ Changes

Added image atomics feature to RenderingDevice
🚀 Changed default compute encoder to execute dispatch calls concurrently
Updated SPIRV-Cross to include atomic fixes
Reduce CPU overhead in Metal driver by avoiding unnecessary API calls
- Reduces API warnings when running Godot under Xcode Metal Debugger for unnecessary state changes

🚀 Performance

4.5 / main with M4 MacBook Pro Max

Using the Fog Stress project with --fog-volumes 5000

Atomics PR (enabled): this PR with image atomics enabled for fog volumes (default)
Atomics PR (disabled): this PR with image atomics disabled via environment variable
Beta 1

The following chart shows FPS percentiles and GPU time percentiles. This PR is almost 2x!

The following chart is GPU frame time, over time, and also shows a very consistent 2x improvement.

Finally, note that atomic operations on images or buffers appears to be similar.

Godot Reflection (Tesseract) Benchmark

Also worth noting that other benchmarks show a GPU time improvement too.

4.5 / main with M1 MacBook Pro Max

Using the Fog Stress project with --fog-volumes 5000

The following chart shows FPS percentiles and GPU time percentiles. The difference is more significant.

The following chart is GPU frame time, over time, and also shows a very consistent 7x improvement.

Why is M1 faster in these tests?

My M4 is running multiple displays, which is affecting FPS, the M1 is only running built-in display.

🥼 Testing

fog_stress.zip

Note

To disable image atomics, you can specify the following environment variable when launching Godot:

GODOT_METAL_DISABLE_IMAGE_ATOMICS=1

Volumetric Fog from Godot Demo Projects

✅ MacBook Pro M4
✅ iPhone 12 Pro (Apple6 GPU)

Important

This would supersede @akien-mga's update of SPIRV-Cross in #107773, as I had to patch it to fix the MSL generation. A .patch file is included and an issue opened for the SPIRV-Cross project.

stuartcarnie · 2025-06-27T08:18:42Z

Fixed missing constant binding, so it should pass CI/CD now

akien-mga · 2025-06-27T08:30:56Z

thirdparty/README.md

@@ -990,6 +990,10 @@ Files extracted from upstream source:
 Versions of this SDK do not have to match the `vulkan` section, as this SDK is required
 to generate Metal source from Vulkan SPIR-V.

+Patches:
+
+- `0001-fix-argument-buffer-access.patch`


Suggested change

- `0001-fix-argument-buffer-access.patch`

- `0001-fix-argument-buffer-access.patch` (GH-108028)

I've removed the patch and updated SPIRV-Cross, as it has been fixed

Calinou

Tested locally, it works as expected. Code looks good to me.

Testing project: test-atomic-performance.zip

However, I can't reproduce any performance improvement on the testing project linked above (a very simple scene with 4,096 FogVolumes using the fallback material). In fact, I get less FPS now:

Metal `master`	MoltenVK `master`	Metal PR
95 FPS	91 FPS	80 FPS

For reference, the scene with volumetric fog disabled (and no FogVolumes) gives me 768 FPS.

Mac specifications

MacBook Pro 16 2024
SoC: M4 Max 16-core CPU, 40-core GPU
RAM: 48 GB
SSD: 1 TB
OS: macOS 15.5

stuartcarnie · 2025-07-01T20:18:54Z

However, I can't reproduce any performance improvement on the testing project linked above (a very simple scene with 4,096 FogVolumes using the fallback material). In fact, I get less FPS now

@Calinou I have been seeing the same thing; and initially I thought it was a difference between 4.4 and 4.5 in general, but I'm also seeing the difference between main and this branch.

I added an environment variable so that I could change between buffers and textures on the fly and I'm seeing 78fps with atomic operations on textures and 92fps with atomic operations on buffers. 🤷🏻

I enabled the Vulkan memory model in the Fog shaders, to remove the memory_coherency_device qualifier from the texture bindings:

#pragma use_vulkan_memory_model

Whilst it did remove the memory_coherence_device from the texture biding in the shader, it didn't make a difference to FPS.

Taking a step back, I'm now not to sure why the storage buffer version would be slower, given it also uses atomic operations? It seems that there may be different behaviour regarding coherency between textures and buffers that may be the difference.

stuartcarnie · 2025-07-01T21:55:13Z

More data to support the difference.

This image is a GPU capture using buffer atomic operations. The atomic_fetch_add_explicit isn't even showing up.

The following image is a GPU capture using image atomic operations. The popup shows the breakdown of instructions for the line that performs the atomic_fetch_and_add on the texture. It costs 17% of the entire function:

stuartcarnie · 2025-07-02T02:28:25Z

@clayjohn / @Calinou Why is the atomic fetches needed at all? All the read / write operations use pos, which is the thread's position for the dispatch call, and pos isn't mutated in any way, which means that it reads / writes to the location defined by gl_GlobalInvocationID.

All the dispatch calls are to a single command buffer, so they can't execute in parallel either.

Instead of:

imageAtomicAdd(density_only_map, pos, final_density);

Couldn't it be:

imageStore(density_only_map, pos, imageLoad(density_only_map, pos) + final_density);

clayjohn · 2025-07-02T02:44:52Z

@stuartcarnie on AMD all the dispatches are processed in parellel. Using load/store only works if you put a barrier between each dispatch

I'm surprised to hear that the dispatches aren't executed in parallel though. On Apple silicon they definitely are for vertex and fragment workloads

stuartcarnie · 2025-07-02T08:06:24Z

on AMD all the dispatches are processed in parellel. Using load/store only works if you put a barrier between each dispatch

Thanks! You sent me down a long research rabbit hold which eventually lead me to finding that the default behaviour of a Metal MTLComputeCommandEncoder is serial execution of dispatch calls; however, you can construct a command encoder with alternative dispatch types, so that dispatch commands are executed concurrently:

https://developer.apple.com/documentation/metal/mtlcommandbuffer/makecomputecommandencoder(dispatchtype:)?language=objc

Given Godot's graphics API is built on Vulkan with those assumptions of memory model and execution, I can safely switch to the concurrent dispatch type. In doing so, my FPS changed as follows:

shader mode	serial dispatch (default)	concurrent dispatch
image atomics	79	118
buffer atomics	91	118

We can see the effects in the GPU profile too. The following is an image of the timeline of the volumetric fog shader dispatch calls. Note all the gaps:

Now look at the following with concurrent execution enabled:

stuartcarnie · 2025-07-02T19:39:27Z

drivers/metal/metal_device_properties.mm

+	if (OS::get_singleton()->get_environment("GODOT_MTL_DISABLE_IMAGE_ATOMICS") == "1") {
+		features.supports_image_atomic_32_bit = false;
+		features.supports_image_atomic_64_bit = false;
+	}


Allow image atomics to be disabled, which is useful for comparing performance.

Note

You will need to delete the Volumetric shaders from the cache, as they require regeneration

stuartcarnie · 2025-07-02T19:42:19Z

drivers/metal/metal_objects.h

Two changes to the MDCommandBuffer:

Like uniform state, push constants are captured and only bound to the pipeline state when the render or compute shader is executed.

Compute shaders now capture uniform state, like render shaders, so that they only bind changed uniforms sets, which reduces the number of API calls to Metal by 1,000s of calls, especially for something like the Fog stress tests, where compute dispatch and uniform sets are bound for every fog volume.

Further, compute shaders are only created if dispatched, which removes some "empty encoder" warnings from Metal Graphics Debugger in Xcode

stuartcarnie · 2025-07-02T19:43:45Z

drivers/metal/rendering_device_driver_metal.mm

@@ -2417,6 +2445,7 @@ bool isArrayTexture(MTLTextureType p_type) {

 	MTLComputePipelineDescriptor *desc = [MTLComputePipelineDescriptor new];
 	desc.computeFunction = function;
+	desc.label = conv::to_nsstring(shader->name);


This is really handy for debugging, as now we can see all the shader names in the Metal Graphics Debugger in Xcode

stuartcarnie · 2025-07-02T19:45:02Z

servers/rendering/renderer_rd/environment/fog.cpp

+		RD::get_singleton()->set_resource_name(emissive_map, "Fog emissive map");
+	} else {
+		tf.format = RD::DATA_FORMAT_R32_UINT;
+		tf.usage_bits = RD::TEXTURE_USAGE_STORAGE_BIT | RD::TEXTURE_USAGE_CAN_COPY_TO_BIT | RD::TEXTURE_USAGE_STORAGE_ATOMIC_BIT;


We added TEXTURE_USAGE_STORAGE_ATOMIC_BIT as the engine uses this to validate the texture format supports image atomics and can inform drivers, like Metal, that the texture will be used for atomic image operations

stuartcarnie · 2025-07-02T19:46:05Z

servers/rendering/renderer_rd/shaders/environment/volumetric_fog.glsl

+#pragma use_vulkan_memory_model
+


This informs SPIRV-Cross to use the vulkan memory model when generating SPIRV, which includes more fine-grained control of coherency.

I can't find much about this online. Do you have a link to somewhere with a clear explanation of what behaviour changes as a result of enabling this? Discussions on gpuweb/gpuweb#2377 seem to indicate that the changes made to the SPIRV as a result of this pragma are basically ineffective with Vulkan without enabling some Vulkan memory model stuff on the API side.

Does this benefit SPIRV-Cross/our MSL transpilation?

Yes, that is what I found too. It only adds some directives inside the SPIR-V, and for the Vulkan Driver, the Vulkan memory model is already assumed. So this informs SPIR-V translation that the GLSL is using the Vulkan Memory Model.

stuartcarnie · 2025-07-02T22:43:40Z

@Calinou for some reason your test-atomic-performance.zip returns a 404.

I would run my test, as it ensures the fog volume has density, to force some atomic usage.

Edit

Well, when I pasted it into this comment, it worked 🙄

Calinou · 2025-07-03T16:44:52Z

I tested performance again on the latest revision of this PR with the MRP from #108028 (review):

master: 98 FPS
This PR: 570 FPS (fluctuates a fair bit though, it's between 300 and 570 FPS).

However, interestingly, I can't exit Godot in the MRP anymore with this PR even though I can in master (Ctrl + C also doesn't work). I need to force quit it instead.

@Calinou for some reason your test-atomic-performance.zip returns a 404.

darksylinc encountered the same issue in #108127 (comment). I'm not sure what's causing it.

stuartcarnie · 2025-07-03T19:48:11Z

However, interestingly, I can't exit Godot in the MRP anymore with this PR even though I can in master (Ctrl + C also doesn't work). I need to force quit it instead.

Which MRP are you using?

I ran my test project on my M4 and M1 MacBook Pros and was able to quit using the traffic lights and CTRL+C 🤔

stuartcarnie · 2025-07-03T19:48:34Z

master: 98 FPS

This PR: 570 FPS (fluctuates a fair bit though, it's between 300 and 570 FPS).

That is a pretty big jump!

stuartcarnie · 2025-07-03T20:41:26Z

@clayjohn (or anyone else) can you reproduce @Calinou's issue:

However, interestingly, I can't exit Godot in the MRP anymore with this PR even though I can in master (Ctrl + C also doesn't work). I need to force quit it instead.

I've tried it on two different machines and I am unable to reproduce and can exit fine either closing the window or pressing CTRL+C:

With fog_stress, you should launch with --fog-volumes 5000 to add 5000 volumes to the scene.

stuartcarnie · 2025-07-03T22:12:46Z

Update

False alarm, it was my mistake as I forgot to include the change to rendering_device_driver_metal.mm in this diff:

https://github.com/godotengine/godot/compare/283cacc39f8e9ba12138ea8b59e03cf9fb3c6ebf..5455d643311e2113ae3f993d5effb9c74abacf38

It meant that the semaphore was not signalled when the command buffer was nil, and so the process would hang.

Calinou

Works great on my end now 🙂

stuartcarnie · 2025-07-04T00:13:54Z

Works great on my end now 🙂

Thanks! Appreciate your patience too!

drivers/metal/rendering_device_driver_metal.mm

Co-authored-by: A Thousand Ships <[email protected]>

stuartcarnie · 2025-07-04T19:59:45Z

Thanks @AThousandShips – all your changes are incorporated!

Repiteo · 2025-07-07T15:14:47Z

Thanks!

stuartcarnie requested review from a team as code owners June 27, 2025 00:00

stuartcarnie force-pushed the apple_image_atomics branch from 8f8de15 to d262baa Compare June 27, 2025 01:11

AThousandShips added enhancement platform:ios platform:macos topic:rendering topic:porting labels Jun 27, 2025

AThousandShips added this to the 4.x milestone Jun 27, 2025

stuartcarnie force-pushed the apple_image_atomics branch from d262baa to c8ebe5a Compare June 27, 2025 08:18

akien-mga reviewed Jun 27, 2025

View reviewed changes

stuartcarnie force-pushed the apple_image_atomics branch from c8ebe5a to 6d8947b Compare June 28, 2025 23:26

Calinou reviewed Jun 30, 2025

View reviewed changes

stuartcarnie force-pushed the apple_image_atomics branch from 6d8947b to a435646 Compare July 2, 2025 19:36

stuartcarnie commented Jul 2, 2025

View reviewed changes

stuartcarnie force-pushed the apple_image_atomics branch from a435646 to 283cacc Compare July 2, 2025 19:47

clayjohn approved these changes Jul 2, 2025

View reviewed changes

clayjohn modified the milestones: 4.x, 4.6 Jul 2, 2025

stuartcarnie force-pushed the apple_image_atomics branch from 283cacc to 5455d64 Compare July 3, 2025 22:09

stuartcarnie force-pushed the apple_image_atomics branch from 5455d64 to e5b8cc2 Compare July 3, 2025 23:48

Calinou approved these changes Jul 3, 2025

View reviewed changes

clayjohn modified the milestones: 4.6, 4.5 Jul 4, 2025

AThousandShips approved these changes Jul 4, 2025

View reviewed changes

drivers/metal/rendering_device_driver_metal.mm Outdated Show resolved Hide resolved

drivers/metal/rendering_device_driver_metal.mm Outdated Show resolved Hide resolved

drivers/metal/rendering_device_driver_metal.mm Outdated Show resolved Hide resolved

stuartcarnie force-pushed the apple_image_atomics branch from 6f678c0 to 59a2f43 Compare July 4, 2025 19:58

Apple: Use image atomic operations on supported Apple hardware

5230f6c

Co-authored-by: A Thousand Ships <[email protected]>

stuartcarnie force-pushed the apple_image_atomics branch from 59a2f43 to 5230f6c Compare July 4, 2025 19:59

Repiteo merged commit db1b6b9 into godotengine:master Jul 7, 2025
20 checks passed

stuartcarnie deleted the apple_image_atomics branch July 8, 2025 22:27

goncalo mentioned this pull request Jul 9, 2025

Importing textures with VRAM Compressed mode fails on Apple M1. #108445

Open

bruvzg mentioned this pull request Jul 9, 2025

[macOS] Selectively bake specific shader variants for MoltenVK. #107794

Open

	- `0001-fix-argument-buffer-access.patch`
	- `0001-fix-argument-buffer-access.patch` (GH-108028)

Uh oh!

Metal: Use image atomic operations on supported Apple hardware #108028

Metal: Use image atomic operations on supported Apple hardware #108028

Conversation

stuartcarnie commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🗒️ Changes

🚀 Performance

4.5 / main with M4 MacBook Pro Max

Godot Reflection (Tesseract) Benchmark

4.5 / main with M1 MacBook Pro Max

Why is M1 faster in these tests?

🥼 Testing

Uh oh!

stuartcarnie commented Jun 27, 2025

Uh oh!

akien-mga Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

stuartcarnie Jun 28, 2025

Choose a reason for hiding this comment

Uh oh!

Calinou left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stuartcarnie commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stuartcarnie commented Jul 1, 2025

Uh oh!

stuartcarnie commented Jul 2, 2025

Uh oh!

clayjohn commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stuartcarnie commented Jul 2, 2025

Uh oh!

stuartcarnie Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

stuartcarnie Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stuartcarnie Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

stuartcarnie Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

stuartcarnie Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

clayjohn Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

stuartcarnie Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stuartcarnie commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Edit

Uh oh!

Calinou commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stuartcarnie commented Jul 3, 2025

Uh oh!

stuartcarnie commented Jul 3, 2025

Uh oh!

stuartcarnie commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stuartcarnie commented Jul 3, 2025

Update

Uh oh!

Calinou left a comment

Choose a reason for hiding this comment

stuartcarnie commented Jun 27, 2025 •

edited

Loading

Calinou left a comment •

edited

Loading

stuartcarnie commented Jul 1, 2025 •

edited

Loading

clayjohn commented Jul 2, 2025 •

edited

Loading

stuartcarnie Jul 2, 2025 •

edited

Loading

stuartcarnie Jul 2, 2025 •

edited

Loading

stuartcarnie commented Jul 2, 2025 •

edited

Loading

Calinou commented Jul 3, 2025 •

edited

Loading

stuartcarnie commented Jul 3, 2025 •

edited

Loading