Modernize JPEG Color Converters #2917

JimBobSquarePants · 2025-05-05T05:55:33Z

Prerequisites

I have written a descriptive pull-request title
I have verified that there are no overlapping pull-requests open
I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
I have provided test coverage for my change (where applicable)

Description

With .NET 8 we can take advantage of the VectorXXX<T> types to provide cross-platform hardware intrinsics with great support on all platforms. This allows us to simplify our code by normalizing the approach and additionally take advantage of performance improvements due to excellent Vector128 support on Mono.

This PR replaces the existing CPU-specific implementation with implementations for Vector128, Vector256, and Vector512.

Benchmarks look good overall (exceeding or meeting the main branch). Vector512 is slightly disappointing and only shows marginal speedup on my machine. I believe this is due to two primary factors:

Throttling - Intel CPUs aggressively downclock when AVX-512 instructions are executed to manage thermals and power draw. This frequency reduction applies across all cores, which can negate the expected throughput gains from using 512-bit wide vectors.
Memory utilization issues - AVX-512 loads and stores perform best when data is aligned to 64-byte boundaries. Without guaranteed alignment, memory operations may fall back to unaligned access paths that split operations internally, causing increased micro-op pressure and reduced bandwidth efficiency. Additionally, when working with larger vectors, cache line pressure and prefetcher behavior become more critical; misaligned or poorly laid-out memory can further exacerbate performance bottlenecks. This highlights the importance of using properly aligned memory when utilizing Vector512<T> in hot paths.

Handling memory alignment is something we've discussed before and the APIs to do so exist now that would make it a lot easier. I do wonder though whether we should simply default to pinning with alignment given almost all our code goes through some sort of SIMD operation?

@beeradmoore I'm wondering if you'd be able to benchmark this branch using your approach from #2762 to see if we've finally reached decent performance on Android? Only if you have time of course, I wouldn't want to put you out.

@tannergooding I'm tagging you here if that's OK as we've discussed these changes previously and you input is always super valuable.

Copilot

Pull Request Overview

This PR modernizes the JPEG color conversion implementations by replacing the old CPU-specific code with new implementations that leverage .NET 8’s VectorXXX types for improved cross‐platform performance and simplified code. The key changes include new vectorized implementations for 512-, 256-, and 128‑bit conversions for RGB, Grayscale, and CMYK channels; renaming of classes and methods for consistency; and removal of legacy vector and scalar implementations.

Reviewed Changes

Copilot reviewed 44 out of 44 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.RgbVector512.cs	Introduces a new 512-bit vectorized implementation for RGB conversion.
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.RgbVector256.cs	Updates 256-bit RGB conversion and renames class/methods to align with modern naming.
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.RgbVector128.cs	Revises the 128-bit RGB conversion to use operator overloads for clarity.
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.RgbVector.cs	Legacy vectorized RGB conversion removed.
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.RgbScalar.cs	Updates scalar RGB conversion with improved naming consistency.
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.GrayScaleVector512.cs	Introduces a new 512-bit vectorized implementation for grayscale conversion.
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.GrayScaleVector256.cs	Updates 256-bit grayscale conversion with corrected comments and naming.
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.GrayScaleVector128.cs	Revises 128-bit grayscale conversion in line with updated naming and operator usage.
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.GrayScaleVector.cs	Legacy vectorized grayscale conversion removed.
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.GrayScaleScalar.cs	Updates scalar grayscale conversion for consistency.
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.CmykVector512.cs	Introduces a new 512-bit vectorized CMYK conversion implementation.
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.CmykVector256.cs	Updates 256-bit CMYK conversion with modern API usage and renaming.
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.CmykVector128.cs	Revises 128-bit CMYK conversion and improves vector operations.
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.CmykVector.cs	Legacy vectorized CMYK conversion removed.
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.CmykScalar.cs	Updates scalar CMYK conversion with improved method naming.
src/ImageSharp/Common/Helpers/Vector[128/256/512]Utilities.cs, SimdUtils.cs, SimdUtils.HwIntrinsics.cs	Modernizes SIMD helper functions to support new language patterns and operator overloads.

Comments suppressed due to low confidence (2)

src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs:652

The TODO comment indicates an order discrepancy in the FMA intrinsic usage; clarify in the documentation how the argument order relates to the intrinsic behavior and update the comment once standardized.

return Sse.Add(Sse.Multiply(vm0, vm1), va);

src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.CmykVector256.cs:81

[nitpick] Verify that using 'Equals' to create the mask for handling division-by-zero scenarios is optimal across all target architectures; consider benchmarking against alternative intrinsic calls if available.

Vector256<float> kMask = Vector256.Equals(ktmp, scale);

beeradmoore · 2025-05-05T06:06:53Z

@JimBobSquarePants, sure thing. Would also be nice to update my results to .NET 9 as well.

By chance do any of them GitHub builds put out a nuget package or am I building it myself for the tests?

JimBobSquarePants · 2025-05-05T06:08:15Z

@JimBobSquarePants, sure thing. Would also be nice to update my results to .NET 9 as well.

By chance do any of them GitHub builds put out a nuget package or am I building it myself for the tests?

Legend!

They only produce a package once merged so you would have to build against this particular branch. Building should work out of the box.

JimBobSquarePants · 2025-05-05T06:13:16Z

src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs

-            }
-
-            return Sse.Add(Sse.Multiply(vm0, vm1), va);
+            return va + (vm0 * vm1);


I actually intend on removing this copy of the method in future PRs. Doing so in this one would have touched too many unrelated files.

JimBobSquarePants · 2025-05-05T06:14:02Z

src/ImageSharp/Common/Helpers/SimdUtils.cs

    /// </summary>
    /// <param name="v">The vector</param>
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    internal static Vector<float> FastRound(this Vector<float> v)
    {
-        if (Avx2.IsSupported)
+        if (Avx2.IsSupported && Vector<float>.Count == Vector256<float>.Count)


I don't know whether this is ever false. But I wanted the check to be a little stricter given for the future potential of 512 supported Vector<T>

It can be false even when Avx2.IsSupported as someone can set DOTNET_PreferredVectorBitWidth=128, so having an explicit check is goodness.

Notably you have Sse41.RoundToNearestInteger that can be used for 128-bit, Avx512F.RoundScale that can be used for 512-bit, and AdvSimd.RoundToNearest that can be used for Arm64.

.NET 9+ defines Vector.Round which can be used to handle the xplat consideration without having to manually code it all (and same method exists on Vector128/256/512)

tannergooding · 2025-05-05T14:21:38Z

src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs

@@ -616,44 +616,6 @@ public static Vector256<float> MultiplyAdd(
                return Fma.MultiplyAdd(vm1, vm0, va);
            }

-            if (Avx.IsSupported)


nit: You can have a similar path for Arm64 (note the parameter order is different from Fma on x64)

if (AdvSimd.IsSupported) { return AdvSimd.FusedMultiplyAdd(va, vm1, vm0); }

When you're able to move to .NET 9+, you can switch all this to just Vector128.MultiplyAddEstimate(x, y, addend) which handles this optimization for you.

tannergooding · 2025-05-05T14:37:59Z

src/ImageSharp/Common/Helpers/Vector128Utilities.cs

@@ -193,13 +193,60 @@ public static Vector128<int> ConvertToInt32RoundToEven(Vector128<float> vector)
            return AdvSimd.ConvertToInt32RoundToEven(vector);
        }

-        Vector128<float> sign = vector & Vector128.Create(-0.0f);
-        Vector128<float> val_2p23_f32 = sign | Vector128.Create(8388608.0f);
+        Vector128<float> sign = vector & Vector128.Create(-0F);


No general helper exists for .NET 9+ here...

It's something we could probably pattern recognize as ConvertToInt32(Round(vector)) though. If it's a pattern you'd like to see us optimize, logging a suggestion on dotnet/runtime would be beneficial. Alternatively some ConvertToInt32(vector, MidpointRounding) API could be provided as well and an API suggestion could be opened.

I'll have a look at creating an issue. Thanks.

tannergooding · 2025-05-05T14:38:58Z

src/ImageSharp/Common/Helpers/Vector256Utilities.cs

+        if (Avx.IsSupported)
+        {
+            return Avx.Add(Avx.Multiply(vm0, vm1), va);
+        }


This is unnecessary, it does the same thing as the va + (vm0 * vm1) logic just below

tannergooding · 2025-05-05T14:39:48Z

src/ImageSharp/Common/Helpers/Vector128Utilities.cs

+        if (Fma.IsSupported)
+        {
+            return Fma.MultiplyAdd(vm1, vm0, va);
+        }
+
+        return va + (vm0 * vm1);


Similar comment as on the SimdUtils.HwIntrinsics.cs copy of the method, where you could add an AdvSimd path for Arm64 and eventually switch to MultiplyAddEstimate when you target .NET 9+

tannergooding · 2025-05-05T14:46:20Z

src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.CmykVector128.cs

+                Vector128<float> ctmp = scale - Unsafe.Add(ref srcR, i);
+                Vector128<float> mtmp = scale - Unsafe.Add(ref srcG, i);
+                Vector128<float> ytmp = scale - Unsafe.Add(ref srcB, i);
+                Vector128<float> ktmp = Vector128.Min(ctmp, Vector128.Min(mtmp, ytmp));


Just as a note, on .NET 9+ the behavior of Min has changed for x86/x64.

On .NET 8 and prior, Vector128.Min differed in behavior from Math.Min and it basically just emitted whatever single instruction was fastest. This lead to non-determinism across x64, Arm64, and other platforms when dealing with -0 and NaN.

On .NET 9 and later, Vector128.Min now behaves identically to Math.Min and is deterministic across all platforms. This can show as a performance regression on x64 for some older hardware as additional instructions must be executed to make this guarantee (newer hardware has dedicated instructions that behave correctly).

If (on .NET 9+) you don't have any concerns around how -0 is sorted compared to +0 and you don't have any concerns around NaN, then you can use Vector128.MinNative to continue getting the "fast" but potentially non-deterministic behavior.

tannergooding · 2025-05-05T14:50:45Z

src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.CmykVector128.cs

+                Vector128<float> kMask = Vector128.Equals(ktmp, scale);
+                ctmp = Vector128.AndNot((ctmp - ktmp) / (scale - ktmp), kMask);
+                mtmp = Vector128.AndNot((mtmp - ktmp) / (scale - ktmp), kMask);
+                ytmp = Vector128.AndNot((ytmp - ktmp) / (scale - ktmp), kMask);


nit: This can be simplified a bit to the following, which should give the JIT more flexibility around register selection:

Vector128<float> kMask = ~Vector128.Equals(ktmp, scale); Vector128<float> divisor = csale - ktmp; ctmp = ((ctmp - ktmp) / divisor) & kMask; mtmp = ((mtmp - ktmp) / divisor) & kMask; ytmp = ((ytmp - ktmp) / divisor) & kMask;

Namely this is just explicitly hoisting the scale - ktmp for reuse and inverting the Equals comparison up front, which will cause a not equals comparison to be emitted instead and then allows the masking to do a bitwise-and, which is commutative.

tannergooding · 2025-05-05T14:54:59Z

src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverterVector128.cs

            : base(colorSpace, precision)
        {
        }

-        public static bool IsSupported => AdvSimd.IsSupported;
+        public static bool IsSupported => Vector128.IsHardwareAccelerated && Vector128<float>.IsSupported;


nit: You don't need to check Vector128<float>.IsSupported, this is statically true.

The Vector128<T>.IsSupported is namely for generic contexts where you don't know concretely what the T is.

In general you can presume that byte, sbyte, short, ushort, int, uint, long, ulong, float, and double are always supported.

You can presume that nint and nuint are supported on .NET 7+

We will likely add support for Half at some point in the future, at which point a Vector128.AsHalf() method will exist to "document" this guarantee of support.

tannergooding

LGTM. Just left a couple comments around things you could do for Arm64 support or things to keep in mind for .NET 9+

beeradmoore · 2025-05-05T22:58:35Z

So some odd things here. It seems between 3.1.8 and this PR there is a regression in Android performance. However, going from 3.1.4 (or that other PR) to even the slowest results in this PR performance has improved. This could be the update to .NET 9, this could be changes in ImageSharp. Below are the numbers I currently have.

This is also with these build configs which is I believe what we found to be optimal from that other issue.

    <PropertyGroup Condition="$([MSBuild]::GetTargetPlatformIdentifier('$(TargetFramework)')) == 'android' AND '$(Configuration)' == 'Release'">
        <EnableLLVM>true</EnableLLVM>
        <RunAOTCompilation>true</RunAOTCompilation>
        <AndroidEnableProfiledAot>false</AndroidEnableProfiledAot>
    </PropertyGroup>

    <PropertyGroup Condition="$([MSBuild]::GetTargetPlatformIdentifier('$(TargetFramework)')) == 'android' AND '$(Configuration)' == 'Debug'">
        <!-- performance improvements for debug mode, will break hot reload. -->
        <UseInterpreter>false</UseInterpreter>
    </PropertyGroup>

Debug (this PR)

Device	JpgLoad	JpgResize	PngLoad	PngResize
iPhone
iOS Simulator
Android	1375.0	1677.0	38.1	47.7
Android Emulator	226.8	284.9	13.6	14.9
macOS
Windows

Debug (3.1.8)

Device	JpgLoad	JpgResize	PngLoad	PngResize
iPhone
iOS Simulator
Android	1080.9	1302.9	37.6	45.4
Android Emulator	189.4	238.6	13.5	14.3
macOS
Windows

Release (this pr)

Device	JpgLoad	JpgResize	PngLoad	PngResize
iPhone
iOS Simulator
Android	339.7	467.7	20.4	27.1
Android Emulator	91.6	116.7	9.7	10.2
macOS
Windows

Release (3.1.8)

Device	JpgLoad	JpgResize	PngLoad	PngResize
iPhone
iOS Simulator
Android	290.2	393.0	19.7	24.0
Android Emulator	80.1	104.7	15.0	16.2
macOS
Windows

Test devices

Device	Hardware	OS
iPhone	iPhone 15 Pro Max	18.4.1
iOS Simulator	iPhone 15	18.4
Android	Pixel 2 XL	Android 15
Android Emulator	Pixel 3a	Android 14
macOS	MacBook Pro M3 Max	Sequoia 15.4.1
Windows	AMD 5950X + RTX 3060	Windows 11 24H2

beeradmoore · 2025-05-05T23:09:47Z

I did 3.1.4 in .NET 9, release mode on Android device and unfortunately it outperforms everything else here. It seems improvements we are seeing is from .NET 9 (or less likely Android 14 which got upgraded to Android 15)

Release - Android physical device

Device	JpgLoad	JpgResize	PngLoad	PngResize
3.1.4 (.NET 8)	515.3	712.2	26.9	35.4
3.1.4 (.NET 9)	285.2	389.6	20.0	24.5
3.1.8 (.NET 9)	290.2	393.0	19.7	24.0
This PR (.NET 9)	339.7	467.7	20.4	27.1

As a mobile dev if I saw these numbers I'd be a little turned off ImageSharp (as a desktop dev I love it in DLSS Swapper). I'd need to check against SkiaSharp and rolling my own native Android implementation (using Android classes in .NET) to see if this is a library performance issue or an Android chugs along issue.

If I were not using jpg I would not be concerned at all.

JimBobSquarePants · 2025-05-06T00:19:26Z

@tannergooding Thanks for the review, I've made all the changes based on your feedback.

@beeradmoore Those numbers are.... disappointing, I'm not sure where the regression is there. I've pushed some changes based on the feedback to MultiplyAdd which should provide a speedup on the YCbCr conversion (and others) but otherwise the conversion code should be solid.

I have, however found potential for some speedup before we hit the color conversion in Block8x8F where we are falling back to scalar rounding on non Avx platforms. I should be able to add SIMD there easily enough. I'll do that in a follow up PR though.

If possible, could you run the benchmark again on this branch one last time before I merge? Regression or not the focus with Mono is heavily orientated on Vector128 so performance can only improve there.

beeradmoore · 2025-05-06T05:50:38Z

@JimBobSquarePants , updated results

Debug (3.1.8)

Device	JpgLoad	JpgResize	PngLoad	PngResize
Android	1084.1	1312.4	37.1	44.4
Android Emulator	189.5	245.1	13.8	14.2

Debug (this updated PR)

Device	JpgLoad	JpgResize	PngLoad	PngResize
Android	1344.77	1586.2	37.5	48.1
Android Emulator	233.3	285.8	13.8	15.6

Release (3.1.8)

Device	JpgLoad	JpgResize	PngLoad	PngResize
Android	285.5	392.9	19.3	26.1
Android Emulator	83.7	96.2	9.4	9.9

Release (this updated PR)

Device	JpgLoad	JpgResize	PngLoad	PngResize
Android	341.0	469.4	20.1	25.8
Android Emulator	99.3	121.1	9.2	10.4

JimBobSquarePants · 2025-05-06T06:16:06Z

Thanks @beeradmoore

I'm wondering now if there isn't some other regression elsewhere in the V4 branch. There is ample opportunity for more SIMD in Block8x8F though so will do that next.

beeradmoore · 2025-05-06T06:20:50Z

Im going to take the work I did and throw it in a new app with benchmark.net so I can also test SkiaSharp as well as native implementations

So numbers here wont be comparable to those

Any other useful tests to add in?

For each image format (jpg, png, webp)
Load
Load + save
Load + resize + save

JimBobSquarePants · 2025-05-06T06:25:38Z

@beeradmoore That would be incredibly useful. Thank you!

JimBobSquarePants added 9 commits May 2, 2025 19:50

Reduce JPEG color conversion to two pipelines

aebe375

Rename and refactor method

cd3aa18

Begin re-implementing converters

959e600

Update MultiplyAdd

597bc0c

Port CMYK converter

3b430e1

Port Grayscale converter

87438a3

Port YccK and use alias

80fee7f

Port RGB converter

57614df

Revert naming

71ca954

JimBobSquarePants added area:performance formats:jpeg platform:mono labels May 5, 2025

JimBobSquarePants added this to the v4.0.0 milestone May 5, 2025

JimBobSquarePants requested review from antonfirsov and Copilot May 5, 2025 05:55

Copilot AI reviewed May 5, 2025

View reviewed changes

Remove unused method

277cf61

JimBobSquarePants commented May 5, 2025

View reviewed changes

tannergooding reviewed May 5, 2025

View reviewed changes

tannergooding approved these changes May 5, 2025

View reviewed changes

Address feedback

5b94795

JimBobSquarePants merged commit 3fbbde9 into main May 6, 2025
8 checks passed

JimBobSquarePants deleted the js/simplify-jpeg-converters branch May 6, 2025 06:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modernize JPEG Color Converters #2917

Modernize JPEG Color Converters #2917

JimBobSquarePants commented May 5, 2025

Copilot AI left a comment

beeradmoore commented May 5, 2025

JimBobSquarePants commented May 5, 2025

JimBobSquarePants May 5, 2025

JimBobSquarePants May 5, 2025 •

edited

Loading

tannergooding May 5, 2025

tannergooding May 5, 2025 •

edited

Loading

tannergooding May 5, 2025

JimBobSquarePants May 5, 2025

tannergooding May 5, 2025

tannergooding May 5, 2025

tannergooding May 5, 2025

tannergooding May 5, 2025

tannergooding May 5, 2025

tannergooding May 5, 2025

tannergooding left a comment

beeradmoore commented May 5, 2025

beeradmoore commented May 5, 2025 •

edited

Loading

JimBobSquarePants commented May 6, 2025

beeradmoore commented May 6, 2025

JimBobSquarePants commented May 6, 2025

beeradmoore commented May 6, 2025

JimBobSquarePants commented May 6, 2025

Modernize JPEG Color Converters #2917

Modernize JPEG Color Converters #2917

Conversation

JimBobSquarePants commented May 5, 2025

Prerequisites

Description

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

beeradmoore commented May 5, 2025

JimBobSquarePants commented May 5, 2025

Choose a reason for hiding this comment

JimBobSquarePants May 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding May 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding left a comment

Choose a reason for hiding this comment

beeradmoore commented May 5, 2025

Debug (this PR)

Debug (3.1.8)

Release (this pr)

Release (3.1.8)

Test devices

beeradmoore commented May 5, 2025 • edited Loading

Release - Android physical device

JimBobSquarePants commented May 6, 2025

beeradmoore commented May 6, 2025

Debug (3.1.8)

Debug (this updated PR)

Release (3.1.8)

Release (this updated PR)

JimBobSquarePants commented May 6, 2025

beeradmoore commented May 6, 2025

JimBobSquarePants commented May 6, 2025

JimBobSquarePants May 5, 2025 •

edited

Loading

tannergooding May 5, 2025 •

edited

Loading

beeradmoore commented May 5, 2025 •

edited

Loading