Skip to content

Modernize JPEG Color Converters #2917

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
May 6, 2025
Merged

Conversation

JimBobSquarePants
Copy link
Member

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

With .NET 8 we can take advantage of the VectorXXX<T> types to provide cross-platform hardware intrinsics with great support on all platforms. This allows us to simplify our code by normalizing the approach and additionally take advantage of performance improvements due to excellent Vector128 support on Mono.

This PR replaces the existing CPU-specific implementation with implementations for Vector128, Vector256, and Vector512.

Benchmarks look good overall (exceeding or meeting the main branch). Vector512 is slightly disappointing and only shows marginal speedup on my machine. I believe this is due to two primary factors:

  1. Throttling - Intel CPUs aggressively downclock when AVX-512 instructions are executed to manage thermals and power draw. This frequency reduction applies across all cores, which can negate the expected throughput gains from using 512-bit wide vectors.

  2. Memory utilization issues - AVX-512 loads and stores perform best when data is aligned to 64-byte boundaries. Without guaranteed alignment, memory operations may fall back to unaligned access paths that split operations internally, causing increased micro-op pressure and reduced bandwidth efficiency. Additionally, when working with larger vectors, cache line pressure and prefetcher behavior become more critical; misaligned or poorly laid-out memory can further exacerbate performance bottlenecks. This highlights the importance of using properly aligned memory when utilizing Vector512<T> in hot paths.

Handling memory alignment is something we've discussed before and the APIs to do so exist now that would make it a lot easier. I do wonder though whether we should simply default to pinning with alignment given almost all our code goes through some sort of SIMD operation?

@beeradmoore I'm wondering if you'd be able to benchmark this branch using your approach from #2762 to see if we've finally reached decent performance on Android? Only if you have time of course, I wouldn't want to put you out.

@tannergooding I'm tagging you here if that's OK as we've discussed these changes previously and you input is always super valuable.

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR modernizes the JPEG color conversion implementations by replacing the old CPU-specific code with new implementations that leverage .NET 8’s VectorXXX types for improved cross‐platform performance and simplified code. The key changes include new vectorized implementations for 512-, 256-, and 128‑bit conversions for RGB, Grayscale, and CMYK channels; renaming of classes and methods for consistency; and removal of legacy vector and scalar implementations.

Reviewed Changes

Copilot reviewed 44 out of 44 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.RgbVector512.cs Introduces a new 512-bit vectorized implementation for RGB conversion.
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.RgbVector256.cs Updates 256-bit RGB conversion and renames class/methods to align with modern naming.
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.RgbVector128.cs Revises the 128-bit RGB conversion to use operator overloads for clarity.
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.RgbVector.cs Legacy vectorized RGB conversion removed.
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.RgbScalar.cs Updates scalar RGB conversion with improved naming consistency.
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.GrayScaleVector512.cs Introduces a new 512-bit vectorized implementation for grayscale conversion.
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.GrayScaleVector256.cs Updates 256-bit grayscale conversion with corrected comments and naming.
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.GrayScaleVector128.cs Revises 128-bit grayscale conversion in line with updated naming and operator usage.
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.GrayScaleVector.cs Legacy vectorized grayscale conversion removed.
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.GrayScaleScalar.cs Updates scalar grayscale conversion for consistency.
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.CmykVector512.cs Introduces a new 512-bit vectorized CMYK conversion implementation.
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.CmykVector256.cs Updates 256-bit CMYK conversion with modern API usage and renaming.
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.CmykVector128.cs Revises 128-bit CMYK conversion and improves vector operations.
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.CmykVector.cs Legacy vectorized CMYK conversion removed.
src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.CmykScalar.cs Updates scalar CMYK conversion with improved method naming.
src/ImageSharp/Common/Helpers/Vector[128/256/512]Utilities.cs, SimdUtils.cs, SimdUtils.HwIntrinsics.cs Modernizes SIMD helper functions to support new language patterns and operator overloads.
Comments suppressed due to low confidence (2)

src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs:652

  • The TODO comment indicates an order discrepancy in the FMA intrinsic usage; clarify in the documentation how the argument order relates to the intrinsic behavior and update the comment once standardized.
return Sse.Add(Sse.Multiply(vm0, vm1), va);

src/ImageSharp/Formats/Jpeg/Components/ColorConverters/JpegColorConverter.CmykVector256.cs:81

  • [nitpick] Verify that using 'Equals' to create the mask for handling division-by-zero scenarios is optimal across all target architectures; consider benchmarking against alternative intrinsic calls if available.
Vector256<float> kMask = Vector256.Equals(ktmp, scale);

@beeradmoore
Copy link

@JimBobSquarePants, sure thing. Would also be nice to update my results to .NET 9 as well.

By chance do any of them GitHub builds put out a nuget package or am I building it myself for the tests?

@JimBobSquarePants
Copy link
Member Author

@JimBobSquarePants, sure thing. Would also be nice to update my results to .NET 9 as well.

By chance do any of them GitHub builds put out a nuget package or am I building it myself for the tests?

Legend!

They only produce a package once merged so you would have to build against this particular branch. Building should work out of the box.

}

return Sse.Add(Sse.Multiply(vm0, vm1), va);
return va + (vm0 * vm1);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually intend on removing this copy of the method in future PRs. Doing so in this one would have touched too many unrelated files.

/// </summary>
/// <param name="v">The vector</param>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static Vector<float> FastRound(this Vector<float> v)
{
if (Avx2.IsSupported)
if (Avx2.IsSupported && Vector<float>.Count == Vector256<float>.Count)
Copy link
Member Author

@JimBobSquarePants JimBobSquarePants May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know whether this is ever false. But I wanted the check to be a little stricter given for the future potential of 512 supported Vector<T>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be false even when Avx2.IsSupported as someone can set DOTNET_PreferredVectorBitWidth=128, so having an explicit check is goodness.

Notably you have Sse41.RoundToNearestInteger that can be used for 128-bit, Avx512F.RoundScale that can be used for 512-bit, and AdvSimd.RoundToNearest that can be used for Arm64.

.NET 9+ defines Vector.Round which can be used to handle the xplat consideration without having to manually code it all (and same method exists on Vector128/256/512)

@@ -616,44 +616,6 @@ public static Vector256<float> MultiplyAdd(
return Fma.MultiplyAdd(vm1, vm0, va);
}

if (Avx.IsSupported)
Copy link
Contributor

@tannergooding tannergooding May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You can have a similar path for Arm64 (note the parameter order is different from Fma on x64)

if (AdvSimd.IsSupported)
{
    return AdvSimd.FusedMultiplyAdd(va, vm1, vm0);
}

When you're able to move to .NET 9+, you can switch all this to just Vector128.MultiplyAddEstimate(x, y, addend) which handles this optimization for you.

@@ -193,13 +193,60 @@ public static Vector128<int> ConvertToInt32RoundToEven(Vector128<float> vector)
return AdvSimd.ConvertToInt32RoundToEven(vector);
}

Vector128<float> sign = vector & Vector128.Create(-0.0f);
Vector128<float> val_2p23_f32 = sign | Vector128.Create(8388608.0f);
Vector128<float> sign = vector & Vector128.Create(-0F);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No general helper exists for .NET 9+ here...

It's something we could probably pattern recognize as ConvertToInt32(Round(vector)) though. If it's a pattern you'd like to see us optimize, logging a suggestion on dotnet/runtime would be beneficial. Alternatively some ConvertToInt32(vector, MidpointRounding) API could be provided as well and an API suggestion could be opened.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a look at creating an issue. Thanks.

Comment on lines 152 to 155
if (Avx.IsSupported)
{
return Avx.Add(Avx.Multiply(vm0, vm1), va);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unnecessary, it does the same thing as the va + (vm0 * vm1) logic just below

Comment on lines 242 to 247
if (Fma.IsSupported)
{
return Fma.MultiplyAdd(vm1, vm0, va);
}

return va + (vm0 * vm1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment as on the SimdUtils.HwIntrinsics.cs copy of the method, where you could add an AdvSimd path for Arm64 and eventually switch to MultiplyAddEstimate when you target .NET 9+

Vector128<float> ctmp = scale - Unsafe.Add(ref srcR, i);
Vector128<float> mtmp = scale - Unsafe.Add(ref srcG, i);
Vector128<float> ytmp = scale - Unsafe.Add(ref srcB, i);
Vector128<float> ktmp = Vector128.Min(ctmp, Vector128.Min(mtmp, ytmp));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a note, on .NET 9+ the behavior of Min has changed for x86/x64.

On .NET 8 and prior, Vector128.Min differed in behavior from Math.Min and it basically just emitted whatever single instruction was fastest. This lead to non-determinism across x64, Arm64, and other platforms when dealing with -0 and NaN.

On .NET 9 and later, Vector128.Min now behaves identically to Math.Min and is deterministic across all platforms. This can show as a performance regression on x64 for some older hardware as additional instructions must be executed to make this guarantee (newer hardware has dedicated instructions that behave correctly).

If (on .NET 9+) you don't have any concerns around how -0 is sorted compared to +0 and you don't have any concerns around NaN, then you can use Vector128.MinNative to continue getting the "fast" but potentially non-deterministic behavior.

Comment on lines 81 to 84
Vector128<float> kMask = Vector128.Equals(ktmp, scale);
ctmp = Vector128.AndNot((ctmp - ktmp) / (scale - ktmp), kMask);
mtmp = Vector128.AndNot((mtmp - ktmp) / (scale - ktmp), kMask);
ytmp = Vector128.AndNot((ytmp - ktmp) / (scale - ktmp), kMask);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This can be simplified a bit to the following, which should give the JIT more flexibility around register selection:

Vector128<float> kMask = ~Vector128.Equals(ktmp, scale);
Vector128<float> divisor = csale - ktmp;

ctmp = ((ctmp - ktmp) / divisor) & kMask;
mtmp = ((mtmp - ktmp) / divisor) & kMask;
ytmp = ((ytmp - ktmp) / divisor) & kMask;

Namely this is just explicitly hoisting the scale - ktmp for reuse and inverting the Equals comparison up front, which will cause a not equals comparison to be emitted instead and then allows the masking to do a bitwise-and, which is commutative.

: base(colorSpace, precision)
{
}

public static bool IsSupported => AdvSimd.IsSupported;
public static bool IsSupported => Vector128.IsHardwareAccelerated && Vector128<float>.IsSupported;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You don't need to check Vector128<float>.IsSupported, this is statically true.

The Vector128<T>.IsSupported is namely for generic contexts where you don't know concretely what the T is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general you can presume that byte, sbyte, short, ushort, int, uint, long, ulong, float, and double are always supported.

You can presume that nint and nuint are supported on .NET 7+

We will likely add support for Half at some point in the future, at which point a Vector128.AsHalf() method will exist to "document" this guarantee of support.

Copy link
Contributor

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just left a couple comments around things you could do for Arm64 support or things to keep in mind for .NET 9+

@beeradmoore
Copy link

So some odd things here. It seems between 3.1.8 and this PR there is a regression in Android performance. However, going from 3.1.4 (or that other PR) to even the slowest results in this PR performance has improved. This could be the update to .NET 9, this could be changes in ImageSharp. Below are the numbers I currently have.

This is also with these build configs which is I believe what we found to be optimal from that other issue.

    <PropertyGroup Condition="$([MSBuild]::GetTargetPlatformIdentifier('$(TargetFramework)')) == 'android' AND '$(Configuration)' == 'Release'">
        <EnableLLVM>true</EnableLLVM>
        <RunAOTCompilation>true</RunAOTCompilation>
        <AndroidEnableProfiledAot>false</AndroidEnableProfiledAot>
    </PropertyGroup>

    <PropertyGroup Condition="$([MSBuild]::GetTargetPlatformIdentifier('$(TargetFramework)')) == 'android' AND '$(Configuration)' == 'Debug'">
        <!-- performance improvements for debug mode, will break hot reload. -->
        <UseInterpreter>false</UseInterpreter>
    </PropertyGroup>

Debug (this PR)

Device JpgLoad JpgResize PngLoad PngResize
iPhone
iOS Simulator
Android 1375.0 1677.0 38.1 47.7
Android Emulator 226.8 284.9 13.6 14.9
macOS
Windows

Debug (3.1.8)

Device JpgLoad JpgResize PngLoad PngResize
iPhone
iOS Simulator
Android 1080.9 1302.9 37.6 45.4
Android Emulator 189.4 238.6 13.5 14.3
macOS
Windows

Release (this pr)

Device JpgLoad JpgResize PngLoad PngResize
iPhone
iOS Simulator
Android 339.7 467.7 20.4 27.1
Android Emulator 91.6 116.7 9.7 10.2
macOS
Windows

Release (3.1.8)

Device JpgLoad JpgResize PngLoad PngResize
iPhone
iOS Simulator
Android 290.2 393.0 19.7 24.0
Android Emulator 80.1 104.7 15.0 16.2
macOS
Windows

Test devices

Device Hardware OS
iPhone iPhone 15 Pro Max 18.4.1
iOS Simulator iPhone 15 18.4
Android Pixel 2 XL Android 15
Android Emulator Pixel 3a Android 14
macOS MacBook Pro M3 Max Sequoia 15.4.1
Windows AMD 5950X + RTX 3060 Windows 11 24H2

@beeradmoore
Copy link

beeradmoore commented May 5, 2025

I did 3.1.4 in .NET 9, release mode on Android device and unfortunately it outperforms everything else here. It seems improvements we are seeing is from .NET 9 (or less likely Android 14 which got upgraded to Android 15)

Release - Android physical device

Device JpgLoad JpgResize PngLoad PngResize
3.1.4 (.NET 8) 515.3 712.2 26.9 35.4
3.1.4 (.NET 9) 285.2 389.6 20.0 24.5
3.1.8 (.NET 9) 290.2 393.0 19.7 24.0
This PR (.NET 9) 339.7 467.7 20.4 27.1

As a mobile dev if I saw these numbers I'd be a little turned off ImageSharp (as a desktop dev I love it in DLSS Swapper). I'd need to check against SkiaSharp and rolling my own native Android implementation (using Android classes in .NET) to see if this is a library performance issue or an Android chugs along issue.

If I were not using jpg I would not be concerned at all.

@JimBobSquarePants
Copy link
Member Author

@tannergooding Thanks for the review, I've made all the changes based on your feedback.

@beeradmoore Those numbers are.... disappointing, I'm not sure where the regression is there. I've pushed some changes based on the feedback to MultiplyAdd which should provide a speedup on the YCbCr conversion (and others) but otherwise the conversion code should be solid.

I have, however found potential for some speedup before we hit the color conversion in Block8x8F where we are falling back to scalar rounding on non Avx platforms. I should be able to add SIMD there easily enough. I'll do that in a follow up PR though.

If possible, could you run the benchmark again on this branch one last time before I merge? Regression or not the focus with Mono is heavily orientated on Vector128 so performance can only improve there.

@beeradmoore
Copy link

@JimBobSquarePants , updated results

Debug (3.1.8)

Device JpgLoad JpgResize PngLoad PngResize
Android 1084.1 1312.4 37.1 44.4
Android Emulator 189.5 245.1 13.8 14.2

Debug (this updated PR)

Device JpgLoad JpgResize PngLoad PngResize
Android 1344.77 1586.2 37.5 48.1
Android Emulator 233.3 285.8 13.8 15.6

Release (3.1.8)

Device JpgLoad JpgResize PngLoad PngResize
Android 285.5 392.9 19.3 26.1
Android Emulator 83.7 96.2 9.4 9.9

Release (this updated PR)

Device JpgLoad JpgResize PngLoad PngResize
Android 341.0 469.4 20.1 25.8
Android Emulator 99.3 121.1 9.2 10.4

@JimBobSquarePants
Copy link
Member Author

Thanks @beeradmoore

I'm wondering now if there isn't some other regression elsewhere in the V4 branch. There is ample opportunity for more SIMD in Block8x8F though so will do that next.

@JimBobSquarePants JimBobSquarePants merged commit 3fbbde9 into main May 6, 2025
8 checks passed
@JimBobSquarePants JimBobSquarePants deleted the js/simplify-jpeg-converters branch May 6, 2025 06:16
@beeradmoore
Copy link

Im going to take the work I did and throw it in a new app with benchmark.net so I can also test SkiaSharp as well as native implementations

So numbers here wont be comparable to those

Any other useful tests to add in?

For each image format (jpg, png, webp)
Load
Load + save
Load + resize + save

@JimBobSquarePants
Copy link
Member Author

@beeradmoore That would be incredibly useful. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants