[18168355911] Improve lambda inlining when using `for_each_enumerated` #2704

IvoDD · 2025-10-13T15:08:54Z

Reference Issues/PRs

Monday ref: 18168355911

What does this implement or fix?

Forces compiler to inline the lambdas everywhere where for_each_enumerated is called.

This fixes the performance regression seen in #2700 and should improve performance in other places in pipeline processing where for_each_enumerated is used (by up to 10-20%).

There is a compromise with inlining of course. Inlining the lambdas causes them to be written in assembly twice (once for each branch of the if). This should be negligible however compared to the rest of our binary.

Godbolt links showing how different compilers inline a toy example. Note that with the tiny lambda example -O3 always inlines it, so all the links show the assembly without optimizations. I suspect for our real usecase code is larger and compiler decides not to inline to save space.

GCC no inlining.
GCC with inlining. Produces one new lambda with inlined special_for_each where the two branches contain a copy of the lambda.
Clang with inlining. Inlines the lambda in two different new lambdas which contain the forward or backward pass. They get called by the two branches of the special_for_each
MSVC with inlining. Inlines the lambda by producing two different special_for_each functions with the lambda inlined.

Any other comments?

Benchmark results (all using ./benchmarks --benchmark_filter="BM_arrow_string.*1/0" --benchmark_min_time=2000x --benchmark_repetitions=5 --benchmark_time_unit=us)
Performance before #2700:

-----------------------------------------------------------------------------------------
Benchmark                                            Time            CPU       Iterations
-----------------------------------------------------------------------------------------
BM_arrow_string_handler/10000/1/0_mean               59.7 us         65.1 us            5
BM_arrow_string_handler/10000/1/0_stddev             1.19 us         1.29 us            5

BM_arrow_string_handler/100000/1/0_mean               503 us          544 us            5
BM_arrow_string_handler/100000/1/0_stddev            8.52 us         3.18 us            5

Performance after #2700:

------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations
------------------------------------------------------------------------------------
BM_arrow_string_handler/10000/1/0_mean          70.3 us         72.4 us            5
BM_arrow_string_handler/10000/1/0_stddev        1.16 us         1.19 us            5

BM_arrow_string_handler/100000/1/0_mean          641 us          669 us            5
BM_arrow_string_handler/100000/1/0_stddev       13.1 us         8.05 us            5

Performance after this PR is in line with before #2700:

------------------------------------------------------------------------------------
Benchmark                                          Time             CPU   Iterations
------------------------------------------------------------------------------------
BM_arrow_string_handler/10000/1/0_mean          55.9 us         58.6 us            5
BM_arrow_string_handler/10000/1/0_stddev       0.596 us        0.625 us            5

BM_arrow_string_handler/100000/1/0_mean          513 us          527 us            5
BM_arrow_string_handler/100000/1/0_stddev       14.6 us         10.0 us            5

Checklist

Checklist for code changes...

Have you updated the relevant docstrings, documentation and copyright notice?
Is this contribution tested against all ArcticDB's features?
Do all exceptions introduced raise appropriate error messages?
Are API changes highlighted in the PR description?
Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

This should fix the performance regression seen in #2700 In a follow up commit we will improve performance in other pipeline processing which uses `for_each_enumerated`.

alexowens90 · 2025-10-15T08:43:28Z

cpp/arcticdb/util/lambda_inlining.hpp

I think we could make this less verbose by defining everything as it is here, and then two additional macros:

#define ARCTICDB_LAMBDA_INLINE(args) ARCTICDB_LAMBDA_INLINE_PRE(args) ARCTICDB_LAMBDA_INLINE_MID ARCTICDB_LAMBDA_INLINE_POST #define ARCTICDB_LAMBDA_INLINE_NOEXCEPT(args) ARCTICDB_LAMBDA_INLINE_PRE(args) ARCTICDB_LAMBDA_INLINE_MID noexcept ARCTICDB_LAMBDA_INLINE_POST

vasil-pashov · 2025-10-15T08:17:03Z

cpp/arcticdb/arrow/arrow_handlers.cpp

-                ++unique_offset_count;
+    for_each_enumerated<ArcticStringColumnTag>(
+            source_column,
+            [&] ARCTICDB_LAMBDA_INLINE_PRE(const auto& en) ARCTICDB_LAMBDA_INLINE_MID ARCTICDB_LAMBDA_INLINE_POST {


Writing all those pre/mid/post seems a bit ugly. We can shorten the macro like so:

#if defined(__clang__) #define LAMBDA_INLINE_ARGS(...) (__VA_ARGS__) __attribute__((always_inline)) #define LAMBDA_INLINE_ARGS_1(m1, ...) (__VA_ARGS__) __attribute__((always_inline)) m1 #define LAMBDA_INLINE_ARGS_2(m1, m2, ...) (__VA_ARGS__) __attribute__((always_inline)) m1 m2 #elif defined(__GNUC__) #define LAMBDA_INLINE_ARGS(...) [[gnu::always_inline]](__VA_ARGS__) #define LAMBDA_INLINE_ARGS_1(m1, ...) [[gnu::always_inline]](__VA_ARGS__) m1 #define LAMBDA_INLINE_ARGS_2(m1, m2, ...) [[gnu::always_inline]](__VA_ARGS__) m1 m2 #elif defined(_MSC_VER) && _MSC_VER >= 1927 && __cplusplus >= 202002L #define LAMBDA_INLINE_ARGS(...) (__VA_ARGS__) [[msvc::forceinline]] #define LAMBDA_INLINE_ARGS_1(m1, m2, ...) (__VA_ARGS__) m1 m2 [[msvc::forceinline]] #define LAMBDA_INLINE_ARGS_2(m1, m2, ...) (__VA_ARGS__) m1 m2 [[msvc::forceinline]] #else #define LAMBDA_INLINE_ARGS(...) (__VA_ARGS__) #define LAMBDA_INLINE_ARGS_1(m1, ...) (__VA_ARGS__) m1 #define LAMBDA_INLINE_ARGS_2(m1, m2, ...) (__VA_ARGS__) m1 m2 #endif

the _1 and _2 "overrides" are needed to handle static/mutable/noexcept modifiers. The standard prohibits using static and mutable at the same time so 2 overrides are enough.

vasil-pashov · 2025-10-15T08:53:00Z

I'm on the fence if we should apply the inlining everywhere or only in places where we know it's beneficial. I'm leaning towards doing only in certain places but it's not a hard opinion.

IvoDD · 2025-10-15T10:15:24Z

I'm on the fence if we should apply the inlining everywhere or only in places where we know it's beneficial. I'm leaning towards doing only in certain places but it's not a hard opinion.

I've only applied the macro to lambdas passed to for_each_enumerated. Since in that case the lambda is called in a very thight loop I think it is always beneficial to inline. (and as we have seen it may not inline it without instruction to do so)

Also I do suspect there is benefit some of the other callsites but our existing benchmarks don't stress that codepath enough.

IvoDD force-pushed the fix-inlining-in-for-each-enumerated branch from 8188afc to 2cccfae Compare October 14, 2025 06:27

Improve lambda inlining when using for_each_enumerated

237945a

This should fix the performance regression seen in #2700 In a follow up commit we will improve performance in other pipeline processing which uses `for_each_enumerated`.

IvoDD force-pushed the fix-inlining-in-for-each-enumerated branch from 2cccfae to 3028dfe Compare October 14, 2025 07:51

Apply lambda inlining for all uses of for_each_enumerated

61798e5

IvoDD force-pushed the fix-inlining-in-for-each-enumerated branch from 3028dfe to 61798e5 Compare October 14, 2025 07:53

IvoDD added patch Small change, should increase patch version performance labels Oct 14, 2025

alexowens90 reviewed Oct 15, 2025

View reviewed changes

alexowens90 approved these changes Oct 15, 2025

View reviewed changes

vasil-pashov approved these changes Oct 15, 2025

View reviewed changes

Improve inlining macro

806b848

IvoDD marked this pull request as ready for review October 15, 2025 14:35

IvoDD requested a review from poodlewars as a code owner October 15, 2025 14:35

IvoDD merged commit 4663a6a into master Oct 15, 2025
272 of 274 checks passed

IvoDD deleted the fix-inlining-in-for-each-enumerated branch October 15, 2025 14:35

IvoDD mentioned this pull request Oct 21, 2025

Parametrize all pipeline tests to run with arrow output format #2610

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[18168355911] Improve lambda inlining when using `for_each_enumerated` #2704

[18168355911] Improve lambda inlining when using `for_each_enumerated` #2704

Uh oh!

IvoDD commented Oct 13, 2025 •

edited

Loading

Uh oh!

alexowens90 Oct 15, 2025

Uh oh!

vasil-pashov Oct 15, 2025

Uh oh!

vasil-pashov commented Oct 15, 2025

Uh oh!

IvoDD commented Oct 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[18168355911] Improve lambda inlining when using for_each_enumerated #2704

[18168355911] Improve lambda inlining when using for_each_enumerated #2704

Uh oh!

Conversation

IvoDD commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement or fix?

Any other comments?

Checklist

Uh oh!

alexowens90 Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

vasil-pashov Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

vasil-pashov commented Oct 15, 2025

Uh oh!

IvoDD commented Oct 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[18168355911] Improve lambda inlining when using `for_each_enumerated` #2704

[18168355911] Improve lambda inlining when using `for_each_enumerated` #2704

IvoDD commented Oct 13, 2025 •

edited

Loading