Skip to content

Conversation

@IvoDD
Copy link
Collaborator

@IvoDD IvoDD commented Oct 13, 2025

Reference Issues/PRs

Monday ref: 18168355911

What does this implement or fix?

Forces compiler to inline the lambdas everywhere where for_each_enumerated is called.

This fixes the performance regression seen in #2700 and should improve performance in other places in pipeline processing where for_each_enumerated is used (by up to 10-20%).

There is a compromise with inlining of course. Inlining the lambdas causes them to be written in assembly twice (once for each branch of the if). This should be negligible however compared to the rest of our binary.

Godbolt links showing how different compilers inline a toy example. Note that with the tiny lambda example -O3 always inlines it, so all the links show the assembly without optimizations. I suspect for our real usecase code is larger and compiler decides not to inline to save space.

  • GCC no inlining.
  • GCC with inlining. Produces one new lambda with inlined special_for_each where the two branches contain a copy of the lambda.
  • Clang with inlining. Inlines the lambda in two different new lambdas which contain the forward or backward pass. They get called by the two branches of the special_for_each
  • MSVC with inlining. Inlines the lambda by producing two different special_for_each functions with the lambda inlined.

Any other comments?

Benchmark results (all using ./benchmarks --benchmark_filter="BM_arrow_string.*1/0" --benchmark_min_time=2000x --benchmark_repetitions=5 --benchmark_time_unit=us)
Performance before #2700:

-----------------------------------------------------------------------------------------
Benchmark                                            Time            CPU       Iterations
-----------------------------------------------------------------------------------------
BM_arrow_string_handler/10000/1/0_mean               59.7 us         65.1 us            5
BM_arrow_string_handler/10000/1/0_stddev             1.19 us         1.29 us            5

BM_arrow_string_handler/100000/1/0_mean               503 us          544 us            5
BM_arrow_string_handler/100000/1/0_stddev            8.52 us         3.18 us            5

Performance after #2700:

------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations
------------------------------------------------------------------------------------
BM_arrow_string_handler/10000/1/0_mean          70.3 us         72.4 us            5
BM_arrow_string_handler/10000/1/0_stddev        1.16 us         1.19 us            5

BM_arrow_string_handler/100000/1/0_mean          641 us          669 us            5
BM_arrow_string_handler/100000/1/0_stddev       13.1 us         8.05 us            5

Performance after this PR is in line with before #2700:

------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations
------------------------------------------------------------------------------------
BM_arrow_string_handler/10000/1/0_mean          55.9 us         58.6 us            5
BM_arrow_string_handler/10000/1/0_stddev       0.596 us        0.625 us            5

BM_arrow_string_handler/100000/1/0_mean          513 us          527 us            5
BM_arrow_string_handler/100000/1/0_stddev       14.6 us         10.0 us            5

Checklist

Checklist for code changes...
  • Have you updated the relevant docstrings, documentation and copyright notice?
  • Is this contribution tested against all ArcticDB's features?
  • Do all exceptions introduced raise appropriate error messages?
  • Are API changes highlighted in the PR description?
  • Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

@IvoDD IvoDD force-pushed the fix-inlining-in-for-each-enumerated branch from 8188afc to 2cccfae Compare October 14, 2025 06:27
This should fix the performance regression seen in #2700
In a follow up commit we will improve performance in other
pipeline processing which uses `for_each_enumerated`.
@IvoDD IvoDD force-pushed the fix-inlining-in-for-each-enumerated branch from 2cccfae to 3028dfe Compare October 14, 2025 07:51
@IvoDD IvoDD force-pushed the fix-inlining-in-for-each-enumerated branch from 3028dfe to 61798e5 Compare October 14, 2025 07:53
@IvoDD IvoDD added patch Small change, should increase patch version performance labels Oct 14, 2025
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could make this less verbose by defining everything as it is here, and then two additional macros:

#define ARCTICDB_LAMBDA_INLINE(args) ARCTICDB_LAMBDA_INLINE_PRE(args) ARCTICDB_LAMBDA_INLINE_MID ARCTICDB_LAMBDA_INLINE_POST
#define ARCTICDB_LAMBDA_INLINE_NOEXCEPT(args) ARCTICDB_LAMBDA_INLINE_PRE(args) ARCTICDB_LAMBDA_INLINE_MID noexcept ARCTICDB_LAMBDA_INLINE_POST

++unique_offset_count;
for_each_enumerated<ArcticStringColumnTag>(
source_column,
[&] ARCTICDB_LAMBDA_INLINE_PRE(const auto& en) ARCTICDB_LAMBDA_INLINE_MID ARCTICDB_LAMBDA_INLINE_POST {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writing all those pre/mid/post seems a bit ugly. We can shorten the macro like so:

#if defined(__clang__)
#define LAMBDA_INLINE_ARGS(...) (__VA_ARGS__) __attribute__((always_inline))
#define LAMBDA_INLINE_ARGS_1(m1, ...) (__VA_ARGS__) __attribute__((always_inline)) m1
#define LAMBDA_INLINE_ARGS_2(m1, m2, ...) (__VA_ARGS__) __attribute__((always_inline)) m1 m2
#elif defined(__GNUC__)
#define LAMBDA_INLINE_ARGS(...) [[gnu::always_inline]](__VA_ARGS__)
#define LAMBDA_INLINE_ARGS_1(m1, ...) [[gnu::always_inline]](__VA_ARGS__) m1
#define LAMBDA_INLINE_ARGS_2(m1, m2, ...) [[gnu::always_inline]](__VA_ARGS__) m1 m2
#elif defined(_MSC_VER) && _MSC_VER >= 1927 && __cplusplus >= 202002L
#define LAMBDA_INLINE_ARGS(...) (__VA_ARGS__) [[msvc::forceinline]]
#define LAMBDA_INLINE_ARGS_1(m1, m2, ...) (__VA_ARGS__) m1 m2 [[msvc::forceinline]]
#define LAMBDA_INLINE_ARGS_2(m1, m2, ...) (__VA_ARGS__) m1 m2 [[msvc::forceinline]]
#else
#define LAMBDA_INLINE_ARGS(...) (__VA_ARGS__)
#define LAMBDA_INLINE_ARGS_1(m1, ...) (__VA_ARGS__) m1
#define LAMBDA_INLINE_ARGS_2(m1, m2, ...) (__VA_ARGS__) m1 m2
#endif

the _1 and _2 "overrides" are needed to handle static/mutable/noexcept modifiers. The standard prohibits using static and mutable at the same time so 2 overrides are enough.

@vasil-pashov
Copy link
Collaborator

I'm on the fence if we should apply the inlining everywhere or only in places where we know it's beneficial. I'm leaning towards doing only in certain places but it's not a hard opinion.

@IvoDD
Copy link
Collaborator Author

IvoDD commented Oct 15, 2025

I'm on the fence if we should apply the inlining everywhere or only in places where we know it's beneficial. I'm leaning towards doing only in certain places but it's not a hard opinion.

I've only applied the macro to lambdas passed to for_each_enumerated. Since in that case the lambda is called in a very thight loop I think it is always beneficial to inline. (and as we have seen it may not inline it without instruction to do so)

Also I do suspect there is benefit some of the other callsites but our existing benchmarks don't stress that codepath enough.

@IvoDD IvoDD marked this pull request as ready for review October 15, 2025 14:35
@IvoDD IvoDD requested a review from poodlewars as a code owner October 15, 2025 14:35
@IvoDD IvoDD merged commit 4663a6a into master Oct 15, 2025
272 of 274 checks passed
@IvoDD IvoDD deleted the fix-inlining-in-for-each-enumerated branch October 15, 2025 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

patch Small change, should increase patch version performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants