Acero streaming join support #46370
Replies: 1 comment
-
Hi @severinson , I believe Acero does internally do (partially [1]) streaming join. I'm not very familiar with Java binding but you can try to switch the the sides of the tables (and subsequently, if you are doing outer join, change the join type from left to right or vice versa). [1] The right side of the join is chosen to be the "build" side, that is, this table is used to build a hash table to be later probed by the left side. Building the hash table requires full presence of the right side data thus is memory intensive. The left table, on the other hand, is processed in a streaming fashion, because every batch (a subset of the table data) can produce a corresponding result (a subset of the full result). Therefore it is much more efficient to use the small table on the right. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hey. Thanks again for the help in my last question @amoeba @westonpace. I worked around the issue by converting my problem to a join on a single column.
However, I have another question: when doing an inner join on two datasets, Acero tries to load the entire left and right table into memory. In what circumstances (if any) could I expect Acero to do a streaming join where only one of the two datasets is loaded into memory and the other is iterated over in chunks?
I’m assuming it tried to load the entire dataset into memory since my program crashes due to trying to allocate more than 32G. The failing allocation comes from within the executeSerializedPlan native method.
I’m calling Acero from Java using its jni bindings and the Acero Substrait consumer. I’m using Arrow 18. I’m joining a left dataset of about 200MB with a right dataset of about 200GB on a UInt8 column. I provide both datasets as Java ArrowReader objects to the Acero Substrait consumer.
I appreciate any help. Thanks :)
Beta Was this translation helpful? Give feedback.
All reactions