-
Notifications
You must be signed in to change notification settings - Fork 1.5k
feat: Support PiecewiseMergeJoin
to speed up single range predicate joins
#16660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
cc @alamb @ozankabak Seems like you guys were part of the discussion for range joins, this is a nice start to it? @Dandandan @comphead @my-vegetable-has-exploded might be interested |
I will try and review this Pr later this week |
Thanks @jonathanc-n let me first get familiar with this kind of join |
Co-authored-by: Oleks V <[email protected]>
let left_indices = (0..left_size as u64).collect::<UInt64Array>(); | ||
let right_indices = (0..left_size) | ||
.map(|idx| left_bit_map.get_bit(idx).then_some(0)) | ||
.collect::<UInt32Array>(); | ||
return (left_indices, right_indices); | ||
} | ||
let left_indices = if join_type == JoinType::LeftSemi { | ||
let left_indices = if join_type == JoinType::LeftSemi | ||
|| (join_type == JoinType::RightSemi && piecewise) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so piecewise works only together with RightSemi?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it falls through with left semi as well. In left and right semi/anti/mark join we use the bitmap to mark all matched sides on the buffered side (this is done in process_unmatched_buffered_batch
), we use the flag to only allow right semi/anti/mark to follow through when calling from the piecewise join. Usually the bitmap is only used to mark the unmatched rows on left side, which is why it originally only holds support for Left semi/anti/mark. I'll add a comment at the beginning of process unmatched buffered batch to explain this.
Rationale for this change
PiecewiseMergeJoin
is a nice pre cursor to the implementation of ASOF, inequality, etc. joins (multiple range predicates).PiecewiseMergeJoin
is specialized for when there is only one range filter and can perform much faster in this case especially for semi, anti, mark joins.What changes are included in this PR?
PiecewiseMergeJoin
implementation only, there is nophysical planner -> PiecewiseMergeJoinExec
.ExecutionPlan
has been implemented forPiecewiseMergeJoinExec
compute_properties
andswap_inputs
is not implementedPiecewiseMergeJoinStream
has been implemented for the actual batch emission logicExamples have been provided for the
PiecewiseMergeJoinExec
andPiecewiseMergeJoinStream
implementations.Benchmark Results
The benchmarks were tested on a random batch of values (streamed side) against a sorted batch (buffered side).
When compared to NestedLoopJoin the queries for classic joins (left, right, inner, full) were about 10x faster 🚀
For existence joins (semi, anti), the join performed about 1000 x faster 🚀
Benchmark Results for normal joins
Benchmark Results for existence joins
If you want to replicate:
Code
Here’s the hidden content that you can put Markdown in,
including lists, code blocks, images, etc.
Next Steps
Pull request was getting large, here are the following steps for this:
Are these changes tested?
Yes unit tests