-
-
Notifications
You must be signed in to change notification settings - Fork 870
Modernize JPEG Color Converters #2917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR modernizes the JPEG color conversion implementations by replacing the old CPU-specific code with new implementations that leverage .NET 8’s VectorXXX types for improved cross‐platform performance and simplified code. The key changes include new vectorized implementations for 512-, 256-, and 128‑bit conversions for RGB, Grayscale, and CMYK channels; renaming of classes and methods for consistency; and removal of legacy vector and scalar implementations.
Reviewed Changes
Copilot reviewed 44 out of 44 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.RgbVector512.cs | Introduces a new 512-bit vectorized implementation for RGB conversion. |
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.RgbVector256.cs | Updates 256-bit RGB conversion and renames class/methods to align with modern naming. |
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.RgbVector128.cs | Revises the 128-bit RGB conversion to use operator overloads for clarity. |
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.RgbVector.cs | Legacy vectorized RGB conversion removed. |
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.RgbScalar.cs | Updates scalar RGB conversion with improved naming consistency. |
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.GrayScaleVector512.cs | Introduces a new 512-bit vectorized implementation for grayscale conversion. |
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.GrayScaleVector256.cs | Updates 256-bit grayscale conversion with corrected comments and naming. |
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.GrayScaleVector128.cs | Revises 128-bit grayscale conversion in line with updated naming and operator usage. |
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.GrayScaleVector.cs | Legacy vectorized grayscale conversion removed. |
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.GrayScaleScalar.cs | Updates scalar grayscale conversion for consistency. |
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.CmykVector512.cs | Introduces a new 512-bit vectorized CMYK conversion implementation. |
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.CmykVector256.cs | Updates 256-bit CMYK conversion with modern API usage and renaming. |
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.CmykVector128.cs | Revises 128-bit CMYK conversion and improves vector operations. |
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.CmykVector.cs | Legacy vectorized CMYK conversion removed. |
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.CmykScalar.cs | Updates scalar CMYK conversion with improved method naming. |
src/ImageSharp/Common/Helpers/Vector[128/256/512]Utilities.cs, SimdUtils.cs, SimdUtils.HwIntrinsics.cs | Modernizes SIMD helper functions to support new language patterns and operator overloads. |
Comments suppressed due to low confidence (2)
src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs:652
- The TODO comment indicates an order discrepancy in the FMA intrinsic usage; clarify in the documentation how the argument order relates to the intrinsic behavior and update the comment once standardized.
return Sse.Add(Sse.Multiply(vm0, vm1), va);
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.CmykVector256.cs:81
- [nitpick] Verify that using 'Equals' to create the mask for handling division-by-zero scenarios is optimal across all target architectures; consider benchmarking against alternative intrinsic calls if available.
Vector256<float> kMask = Vector256.Equals(ktmp, scale);
@JimBobSquarePants, sure thing. Would also be nice to update my results to .NET 9 as well. By chance do any of them GitHub builds put out a nuget package or am I building it myself for the tests? |
Legend! They only produce a package once merged so you would have to build against this particular branch. Building should work out of the box. |
} | ||
|
||
return Sse.Add(Sse.Multiply(vm0, vm1), va); | ||
return va + (vm0 * vm1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually intend on removing this copy of the method in future PRs. Doing so in this one would have touched too many unrelated files.
/// </summary> | ||
/// <param name="v">The vector</param> | ||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
internal static Vector<float> FastRound(this Vector<float> v) | ||
{ | ||
if (Avx2.IsSupported) | ||
if (Avx2.IsSupported && Vector<float>.Count == Vector256<float>.Count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know whether this is ever false. But I wanted the check to be a little stricter given for the future potential of 512 supported Vector<T>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be false even when Avx2.IsSupported
as someone can set DOTNET_PreferredVectorBitWidth=128
, so having an explicit check is goodness.
Notably you have Sse41.RoundToNearestInteger
that can be used for 128-bit
, Avx512F.RoundScale
that can be used for 512-bit, and AdvSimd.RoundToNearest
that can be used for Arm64.
.NET 9+ defines Vector.Round
which can be used to handle the xplat consideration without having to manually code it all (and same method exists on Vector128/256/512
)
@@ -616,44 +616,6 @@ public static Vector256<float> MultiplyAdd( | |||
return Fma.MultiplyAdd(vm1, vm0, va); | |||
} | |||
|
|||
if (Avx.IsSupported) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: You can have a similar path for Arm64 (note the parameter order is different from Fma
on x64)
if (AdvSimd.IsSupported)
{
return AdvSimd.FusedMultiplyAdd(va, vm1, vm0);
}
When you're able to move to .NET 9+, you can switch all this to just Vector128.MultiplyAddEstimate(x, y, addend)
which handles this optimization for you.
@@ -193,13 +193,60 @@ public static Vector128<int> ConvertToInt32RoundToEven(Vector128<float> vector) | |||
return AdvSimd.ConvertToInt32RoundToEven(vector); | |||
} | |||
|
|||
Vector128<float> sign = vector & Vector128.Create(-0.0f); | |||
Vector128<float> val_2p23_f32 = sign | Vector128.Create(8388608.0f); | |||
Vector128<float> sign = vector & Vector128.Create(-0F); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No general helper exists for .NET 9+ here...
It's something we could probably pattern recognize as ConvertToInt32(Round(vector))
though. If it's a pattern you'd like to see us optimize, logging a suggestion on dotnet/runtime would be beneficial. Alternatively some ConvertToInt32(vector, MidpointRounding)
API could be provided as well and an API suggestion could be opened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have a look at creating an issue. Thanks.
if (Avx.IsSupported) | ||
{ | ||
return Avx.Add(Avx.Multiply(vm0, vm1), va); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unnecessary, it does the same thing as the va + (vm0 * vm1)
logic just below
if (Fma.IsSupported) | ||
{ | ||
return Fma.MultiplyAdd(vm1, vm0, va); | ||
} | ||
|
||
return va + (vm0 * vm1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment as on the SimdUtils.HwIntrinsics.cs copy of the method, where you could add an AdvSimd
path for Arm64 and eventually switch to MultiplyAddEstimate
when you target .NET 9+
Vector128<float> ctmp = scale - Unsafe.Add(ref srcR, i); | ||
Vector128<float> mtmp = scale - Unsafe.Add(ref srcG, i); | ||
Vector128<float> ytmp = scale - Unsafe.Add(ref srcB, i); | ||
Vector128<float> ktmp = Vector128.Min(ctmp, Vector128.Min(mtmp, ytmp)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a note, on .NET 9+ the behavior of Min
has changed for x86/x64.
On .NET 8 and prior, Vector128.Min
differed in behavior from Math.Min
and it basically just emitted whatever single instruction was fastest. This lead to non-determinism across x64, Arm64, and other platforms when dealing with -0
and NaN
.
On .NET 9 and later, Vector128.Min
now behaves identically to Math.Min
and is deterministic across all platforms. This can show as a performance regression on x64 for some older hardware as additional instructions must be executed to make this guarantee (newer hardware has dedicated instructions that behave correctly).
If (on .NET 9+) you don't have any concerns around how -0
is sorted compared to +0
and you don't have any concerns around NaN
, then you can use Vector128.MinNative
to continue getting the "fast" but potentially non-deterministic behavior.
Vector128<float> kMask = Vector128.Equals(ktmp, scale); | ||
ctmp = Vector128.AndNot((ctmp - ktmp) / (scale - ktmp), kMask); | ||
mtmp = Vector128.AndNot((mtmp - ktmp) / (scale - ktmp), kMask); | ||
ytmp = Vector128.AndNot((ytmp - ktmp) / (scale - ktmp), kMask); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This can be simplified a bit to the following, which should give the JIT more flexibility around register selection:
Vector128<float> kMask = ~Vector128.Equals(ktmp, scale);
Vector128<float> divisor = csale - ktmp;
ctmp = ((ctmp - ktmp) / divisor) & kMask;
mtmp = ((mtmp - ktmp) / divisor) & kMask;
ytmp = ((ytmp - ktmp) / divisor) & kMask;
Namely this is just explicitly hoisting the scale - ktmp
for reuse and inverting the Equals
comparison up front, which will cause a not equals
comparison to be emitted instead and then allows the masking to do a bitwise-and
, which is commutative.
: base(colorSpace, precision) | ||
{ | ||
} | ||
|
||
public static bool IsSupported => AdvSimd.IsSupported; | ||
public static bool IsSupported => Vector128.IsHardwareAccelerated && Vector128<float>.IsSupported; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: You don't need to check Vector128<float>.IsSupported
, this is statically true
.
The Vector128<T>.IsSupported
is namely for generic contexts where you don't know concretely what the T
is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general you can presume that byte
, sbyte
, short
, ushort
, int
, uint
, long
, ulong
, float
, and double
are always supported.
You can presume that nint
and nuint
are supported on .NET 7+
We will likely add support for Half
at some point in the future, at which point a Vector128.AsHalf()
method will exist to "document" this guarantee of support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just left a couple comments around things you could do for Arm64 support or things to keep in mind for .NET 9+
So some odd things here. It seems between 3.1.8 and this PR there is a regression in Android performance. However, going from 3.1.4 (or that other PR) to even the slowest results in this PR performance has improved. This could be the update to .NET 9, this could be changes in ImageSharp. Below are the numbers I currently have. This is also with these build configs which is I believe what we found to be optimal from that other issue. <PropertyGroup Condition="$([MSBuild]::GetTargetPlatformIdentifier('$(TargetFramework)')) == 'android' AND '$(Configuration)' == 'Release'">
<EnableLLVM>true</EnableLLVM>
<RunAOTCompilation>true</RunAOTCompilation>
<AndroidEnableProfiledAot>false</AndroidEnableProfiledAot>
</PropertyGroup>
<PropertyGroup Condition="$([MSBuild]::GetTargetPlatformIdentifier('$(TargetFramework)')) == 'android' AND '$(Configuration)' == 'Debug'">
<!-- performance improvements for debug mode, will break hot reload. -->
<UseInterpreter>false</UseInterpreter>
</PropertyGroup> Debug (this PR)
Debug (3.1.8)
Release (this pr)
Release (3.1.8)
Test devices
|
I did 3.1.4 in .NET 9, release mode on Android device and unfortunately it outperforms everything else here. It seems improvements we are seeing is from .NET 9 (or less likely Android 14 which got upgraded to Android 15) Release - Android physical device
As a mobile dev if I saw these numbers I'd be a little turned off ImageSharp (as a desktop dev I love it in DLSS Swapper). I'd need to check against SkiaSharp and rolling my own native Android implementation (using Android classes in .NET) to see if this is a library performance issue or an Android chugs along issue. If I were not using jpg I would not be concerned at all. |
@tannergooding Thanks for the review, I've made all the changes based on your feedback. @beeradmoore Those numbers are.... disappointing, I'm not sure where the regression is there. I've pushed some changes based on the feedback to I have, however found potential for some speedup before we hit the color conversion in If possible, could you run the benchmark again on this branch one last time before I merge? Regression or not the focus with Mono is heavily orientated on |
@JimBobSquarePants , updated results Debug (3.1.8)
Debug (this updated PR)
Release (3.1.8)
Release (this updated PR)
|
Thanks @beeradmoore I'm wondering now if there isn't some other regression elsewhere in the V4 branch. There is ample opportunity for more SIMD in |
Im going to take the work I did and throw it in a new app with benchmark.net so I can also test SkiaSharp as well as native implementations So numbers here wont be comparable to those Any other useful tests to add in? For each image format (jpg, png, webp) |
@beeradmoore That would be incredibly useful. Thank you! |
Prerequisites
Description
With .NET 8 we can take advantage of the
VectorXXX<T>
types to provide cross-platform hardware intrinsics with great support on all platforms. This allows us to simplify our code by normalizing the approach and additionally take advantage of performance improvements due to excellentVector128
support on Mono.This PR replaces the existing CPU-specific implementation with implementations for
Vector128
,Vector256
, andVector512
.Benchmarks look good overall (exceeding or meeting the main branch).
Vector512
is slightly disappointing and only shows marginal speedup on my machine. I believe this is due to two primary factors:Throttling - Intel CPUs aggressively downclock when AVX-512 instructions are executed to manage thermals and power draw. This frequency reduction applies across all cores, which can negate the expected throughput gains from using 512-bit wide vectors.
Memory utilization issues - AVX-512 loads and stores perform best when data is aligned to 64-byte boundaries. Without guaranteed alignment, memory operations may fall back to unaligned access paths that split operations internally, causing increased micro-op pressure and reduced bandwidth efficiency. Additionally, when working with larger vectors, cache line pressure and prefetcher behavior become more critical; misaligned or poorly laid-out memory can further exacerbate performance bottlenecks. This highlights the importance of using properly aligned memory when utilizing
Vector512<T>
in hot paths.Handling memory alignment is something we've discussed before and the APIs to do so exist now that would make it a lot easier. I do wonder though whether we should simply default to pinning with alignment given almost all our code goes through some sort of SIMD operation?
@beeradmoore I'm wondering if you'd be able to benchmark this branch using your approach from #2762 to see if we've finally reached decent performance on Android? Only if you have time of course, I wouldn't want to put you out.
@tannergooding I'm tagging you here if that's OK as we've discussed these changes previously and you input is always super valuable.