-
Notifications
You must be signed in to change notification settings - Fork 151
[18168355911] Improve lambda inlining when using for_each_enumerated
#2704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8188afc to
2cccfae
Compare
This should fix the performance regression seen in #2700 In a follow up commit we will improve performance in other pipeline processing which uses `for_each_enumerated`.
2cccfae to
3028dfe
Compare
3028dfe to
61798e5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could make this less verbose by defining everything as it is here, and then two additional macros:
#define ARCTICDB_LAMBDA_INLINE(args) ARCTICDB_LAMBDA_INLINE_PRE(args) ARCTICDB_LAMBDA_INLINE_MID ARCTICDB_LAMBDA_INLINE_POST
#define ARCTICDB_LAMBDA_INLINE_NOEXCEPT(args) ARCTICDB_LAMBDA_INLINE_PRE(args) ARCTICDB_LAMBDA_INLINE_MID noexcept ARCTICDB_LAMBDA_INLINE_POST
| ++unique_offset_count; | ||
| for_each_enumerated<ArcticStringColumnTag>( | ||
| source_column, | ||
| [&] ARCTICDB_LAMBDA_INLINE_PRE(const auto& en) ARCTICDB_LAMBDA_INLINE_MID ARCTICDB_LAMBDA_INLINE_POST { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing all those pre/mid/post seems a bit ugly. We can shorten the macro like so:
#if defined(__clang__)
#define LAMBDA_INLINE_ARGS(...) (__VA_ARGS__) __attribute__((always_inline))
#define LAMBDA_INLINE_ARGS_1(m1, ...) (__VA_ARGS__) __attribute__((always_inline)) m1
#define LAMBDA_INLINE_ARGS_2(m1, m2, ...) (__VA_ARGS__) __attribute__((always_inline)) m1 m2
#elif defined(__GNUC__)
#define LAMBDA_INLINE_ARGS(...) [[gnu::always_inline]](__VA_ARGS__)
#define LAMBDA_INLINE_ARGS_1(m1, ...) [[gnu::always_inline]](__VA_ARGS__) m1
#define LAMBDA_INLINE_ARGS_2(m1, m2, ...) [[gnu::always_inline]](__VA_ARGS__) m1 m2
#elif defined(_MSC_VER) && _MSC_VER >= 1927 && __cplusplus >= 202002L
#define LAMBDA_INLINE_ARGS(...) (__VA_ARGS__) [[msvc::forceinline]]
#define LAMBDA_INLINE_ARGS_1(m1, m2, ...) (__VA_ARGS__) m1 m2 [[msvc::forceinline]]
#define LAMBDA_INLINE_ARGS_2(m1, m2, ...) (__VA_ARGS__) m1 m2 [[msvc::forceinline]]
#else
#define LAMBDA_INLINE_ARGS(...) (__VA_ARGS__)
#define LAMBDA_INLINE_ARGS_1(m1, ...) (__VA_ARGS__) m1
#define LAMBDA_INLINE_ARGS_2(m1, m2, ...) (__VA_ARGS__) m1 m2
#endifthe _1 and _2 "overrides" are needed to handle static/mutable/noexcept modifiers. The standard prohibits using static and mutable at the same time so 2 overrides are enough.
|
I'm on the fence if we should apply the inlining everywhere or only in places where we know it's beneficial. I'm leaning towards doing only in certain places but it's not a hard opinion. |
I've only applied the macro to lambdas passed to Also I do suspect there is benefit some of the other callsites but our existing benchmarks don't stress that codepath enough. |
Reference Issues/PRs
Monday ref: 18168355911
What does this implement or fix?
Forces compiler to inline the lambdas everywhere where
for_each_enumeratedis called.This fixes the performance regression seen in #2700 and should improve performance in other places in pipeline processing where
for_each_enumeratedis used (by up to 10-20%).There is a compromise with inlining of course. Inlining the lambdas causes them to be written in assembly twice (once for each branch of the if). This should be negligible however compared to the rest of our binary.
Godbolt links showing how different compilers inline a toy example. Note that with the tiny lambda example
-O3always inlines it, so all the links show the assembly without optimizations. I suspect for our real usecase code is larger and compiler decides not to inline to save space.special_for_eachwhere the two branches contain a copy of the lambda.special_for_eachspecial_for_eachfunctions with the lambda inlined.Any other comments?
Benchmark results (all using
./benchmarks --benchmark_filter="BM_arrow_string.*1/0" --benchmark_min_time=2000x --benchmark_repetitions=5 --benchmark_time_unit=us)Performance before #2700:
Performance after #2700:
Performance after this PR is in line with before #2700:
Checklist
Checklist for code changes...