-
Notifications
You must be signed in to change notification settings - Fork 1k
[Variant] Define new shred_variant function #8366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CC @alamb -- very interested in your thoughts here. |
/// Builder for converting variant values to primitive Arrow arrays. It is used by both | ||
/// `VariantToArrowRowBuilder` (below) and `VariantToShreddedPrimitiveVariantRowBuilder` (in | ||
/// `shred_variant.rs`). | ||
pub(crate) enum PrimitiveVariantToArrowRowBuilder<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love splitting this out, but it seemed better than copying the whole thing for both parent builders that use the same logic? Very open to ideas for a better approach.
(the new definition of VariantToArrowRowBuilder
is at L95 below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think traits is the way to reduce some of this boiler plate, but that comes with its own downsides as you have previously.
Another thought I have is why not unify them all into a single enum (why wrap the PrimitiveVariantToArrowRowBuilder
🤔 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not unify them all into a single enum
I started out that way, but both builders handle primitive types separately from complex types, and the handling of complex types is different in the two type trees:
PrimitiveVariantToArrowRowBuilder
VariantToShreddedPrimitiveVariantRowBuilder::typed_value
VariantToShreddedVariantRowBuilder::Primitive
- NOTE: The corresponding
Object
enum variant needs to (partially) shred rows intovalue
andtyped_value
columns according to the shredding schema.
VariantToArrowRowBuilder::Primitive
- NOTE: The corresponding
Object
enum variant would eventually need to convert rows into structs, blowing up (or producing NULL) if any field has the wrong type or if there are any unexpected fields.
path: VariantPath<'a>, | ||
data_type: Option<&'a DataType>, | ||
/// Creates a primitive row builder, returning Err if the requested data type is not primitive. | ||
pub(crate) fn make_primitive_variant_to_arrow_row_builder<'a>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A new helper for creating the new primitive row builder enum
(the original helper moved to L175 below)
} | ||
|
||
fn append_value(&mut self, value: &Variant<'_, '_>) -> Result<bool> { | ||
fn append_value(&mut self, value: Variant<'_, '_>) -> Result<bool> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that the primitive row builder is separate a separate API from the normal row builder, we can make the latter take &Variant
while this can go back to taking Variant
.
(again below)
CC @codephage2020 and @klion26 since this touches on work you've both done (or are doing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me -- thank you @scovich
Once we merge in the stacked PRs I think this will be ready to go -- I had a few suggestions but I think they are all pretty easy
It seems like there is some non trivial follow on work that we should perhaps track in tickets. If you agree I can file them
- Shredding out Lists
- Shredding out Structs
- (maybe) benchmarks for shredding.
For benchmarks, I am persionally very interested in the case of documents like this and shredding out the time
and hostname
columns. I am curious if you know of other potential usecases
{
time: 2025-01-01,
hostname: "host1"
message: "blah"
random_field1: 134
random_field2: 231
}
...
{
time: 2025-01-01,
hostname: "host1"
message: "blah"
random_field1: 134
random_field2: 231
}
|
||
if array.value_field().is_none() { | ||
// all-null case | ||
return Ok(VariantArray::from_parts( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the same as VariantArray.clone() 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is, good catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... except that VariantArray doesn't impl Clone
| DataType::ListView(_) | ||
| DataType::LargeListView(_) | ||
| DataType::FixedSizeList(..) => { | ||
// TODO: Special handling for shredded variant arrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't really about shredded variant arrays, right? It is more like
// TODO: Special handling for shredded variant arrays | |
// TODO: handling for structured arrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow? Or maybe we're just respectively looking at it from input vs. output perspective?
After this PR, we will be able to shred primitive variant values (into primitive arrays) and object variant values (into struct arrays), but not (yet) shred array variant values (into one of the various list array types).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I think the wording might be confusing, I'll adjust
/// Builder for converting variant values to primitive Arrow arrays. It is used by both | ||
/// `VariantToArrowRowBuilder` (below) and `VariantToShreddedPrimitiveVariantRowBuilder` (in | ||
/// `shred_variant.rs`). | ||
pub(crate) enum PrimitiveVariantToArrowRowBuilder<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think traits is the way to reduce some of this boiler plate, but that comes with its own downsides as you have previously.
Another thought I have is why not unify them all into a single enum (why wrap the PrimitiveVariantToArrowRowBuilder
🤔 )
I think we have (in no particular order):
|
I think shredding would typically pull in all fields that are common in most/all rows? (I mean you can choose to shred however you want, but any hope of skipping row groups in a parquet file needs stats which needs the column to be shredded). So it might make sense to have each row of data involve a subset of fields with varying names and/or types. Also, some of the rows should occasionally have the "wrong" type wrt the shredding schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for this great work.
field.data_type(), | ||
cast_options, | ||
capacity, | ||
top_level, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does top_level
mean the typed_value
not located in the nested-level of the current variant? Do we need to change the value here? Seems top_level
in shred_variant.rs
did not change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think you're on to something. Digging and wil get back to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: It looks like top_level
is a vestige of some early approach, trying to solve a problem that no longer exists. I gave the object shredding unit test a significant upgrade, where it now checks every single null/value of every single column in the shredded schema. That test already exercised the top-level NULL vs. missing nested object field NULL cases, and it continues to pass after I remove the top_level
concept, so I think we're good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing that code was a poor decision in retrospect... it looks like this caused
And reinstating the code fixes the problem:
|
||
// Row 7: Object with only a "wrong" field | ||
assert!(!value_field.is_null(7)); | ||
assert!(score_typed_values.is_null(7)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to assert typed_value_struct.is_null(7)
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not null. All its fields are null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test has been updated to exhaustively check every column.
} | ||
|
||
#[test] | ||
fn test_object_shredding_comprehensive() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is very nice!
@alamb -- this should be ready for final review+merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @scovich -- I think this is a great building block! I started writing a benchmark and hit the fact this kernel doesn't yet support shredding out columns as Utf8/Utf8View
typed_value: Option<ArrayRef>, | ||
) -> Result<Self, ArrowError> { | ||
/// Create a new `ShreddingState` from the given fields | ||
pub fn new(value: Option<BinaryViewArray>, typed_value: Option<ArrayRef>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amusingly, I did this exact change in #8392 too.
))); | ||
} | ||
_ => { | ||
return Err(ArrowError::InvalidArgumentError(format!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to write a benchmark for shredding and I hit the fact that I can't shred Utf8 columns (which is fine, we'll do it as a follow on PR) but I will file a ticket to track
I am just going to merge this one to avoid conflicts with concurrent PRs @scovich do you have a list of follow on tasks needed after this PR to complete |
Follow on PR with a benchmark is here: |
Several that leap to mind for
(there are probably more, but those should be enough to kick off at least a few more PR) For variant_get
|
# Which issue does this PR close? - Fast-follow for #8366 - Related to #8392 # Rationale for this change Somehow, #8392 exposes a latent bug in #8366, which has improper NULL handling for shredded object fields. The shredding PR originally attempted to handle this case, but somehow the test did not trigger the bug and so the (admittedly incomplete) code was removed. See #8366 (comment). To be honest, I have no idea how the original ever worked correctly, nor why the new PR is able to expose the problem. # What changes are included in this PR? When used as a top-level builder, `VariantToShreddedVariantRowBuilder::append_null` must append NULL to its own `NullBufferBuilder`; but when used as a shredded object field builder, it must append non-NULL. Plumb a new `top_level` parameter through the various functions and into the two sub-builders it relies on, so they can implement the correct semantics. # Are these changes tested? In theory, yes (I don't know how the object shredding test ever passed). And it fixes the breakage in #8392. # Are there any user-facing changes? No
Which issue does this PR close?
shred_variant
function #8361Rationale for this change
See ticket.
What changes are included in this PR?
Define a new
shred_variant
function and implement support for structs and a subset of primitive types.Are these changes tested?
Yes, extensive new unit tests
Are there any user-facing changes?
The new function is public.