Skip to content

Conversation

scovich
Copy link
Contributor

@scovich scovich commented Sep 17, 2025

Which issue does this PR close?

Rationale for this change

See ticket.

What changes are included in this PR?

Define a new shred_variant function and implement support for structs and a subset of primitive types.

Are these changes tested?

Yes, extensive new unit tests

Are there any user-facing changes?

The new function is public.

@github-actions github-actions bot added the parquet-variant parquet-variant* crates label Sep 17, 2025
@scovich
Copy link
Contributor Author

scovich commented Sep 17, 2025

CC @alamb -- very interested in your thoughts here.

/// Builder for converting variant values to primitive Arrow arrays. It is used by both
/// `VariantToArrowRowBuilder` (below) and `VariantToShreddedPrimitiveVariantRowBuilder` (in
/// `shred_variant.rs`).
pub(crate) enum PrimitiveVariantToArrowRowBuilder<'a> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love splitting this out, but it seemed better than copying the whole thing for both parent builders that use the same logic? Very open to ideas for a better approach.

(the new definition of VariantToArrowRowBuilder is at L95 below)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think traits is the way to reduce some of this boiler plate, but that comes with its own downsides as you have previously.

Another thought I have is why not unify them all into a single enum (why wrap the PrimitiveVariantToArrowRowBuilder 🤔 )

Copy link
Contributor Author

@scovich scovich Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not unify them all into a single enum

I started out that way, but both builders handle primitive types separately from complex types, and the handling of complex types is different in the two type trees:

  • PrimitiveVariantToArrowRowBuilder
    • VariantToShreddedPrimitiveVariantRowBuilder::typed_value
      • VariantToShreddedVariantRowBuilder::Primitive
      • NOTE: The corresponding Object enum variant needs to (partially) shred rows into value and typed_value columns according to the shredding schema.
    • VariantToArrowRowBuilder::Primitive
    • NOTE: The corresponding Object enum variant would eventually need to convert rows into structs, blowing up (or producing NULL) if any field has the wrong type or if there are any unexpected fields.

path: VariantPath<'a>,
data_type: Option<&'a DataType>,
/// Creates a primitive row builder, returning Err if the requested data type is not primitive.
pub(crate) fn make_primitive_variant_to_arrow_row_builder<'a>(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new helper for creating the new primitive row builder enum

(the original helper moved to L175 below)

}

fn append_value(&mut self, value: &Variant<'_, '_>) -> Result<bool> {
fn append_value(&mut self, value: Variant<'_, '_>) -> Result<bool> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the primitive row builder is separate a separate API from the normal row builder, we can make the latter take &Variant while this can go back to taking Variant.

(again below)

@scovich
Copy link
Contributor Author

scovich commented Sep 17, 2025

CC @codephage2020 and @klion26 since this touches on work you've both done (or are doing)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me -- thank you @scovich

Once we merge in the stacked PRs I think this will be ready to go -- I had a few suggestions but I think they are all pretty easy

It seems like there is some non trivial follow on work that we should perhaps track in tickets. If you agree I can file them

  1. Shredding out Lists
  2. Shredding out Structs
  3. (maybe) benchmarks for shredding.

For benchmarks, I am persionally very interested in the case of documents like this and shredding out the time and hostname columns. I am curious if you know of other potential usecases

{ 
  time: 2025-01-01,
  hostname: "host1"
  message: "blah"
  random_field1: 134
  random_field2: 231
}
...
{ 
  time: 2025-01-01,
  hostname: "host1"
  message: "blah"
  random_field1: 134
  random_field2: 231
}


if array.value_field().is_none() {
// all-null case
return Ok(VariantArray::from_parts(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same as VariantArray.clone() 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is, good catch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... except that VariantArray doesn't impl Clone

| DataType::ListView(_)
| DataType::LargeListView(_)
| DataType::FixedSizeList(..) => {
// TODO: Special handling for shredded variant arrays
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't really about shredded variant arrays, right? It is more like

Suggested change
// TODO: Special handling for shredded variant arrays
// TODO: handling for structured arrays

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow? Or maybe we're just respectively looking at it from input vs. output perspective?

After this PR, we will be able to shred primitive variant values (into primitive arrays) and object variant values (into struct arrays), but not (yet) shred array variant values (into one of the various list array types).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think the wording might be confusing, I'll adjust

/// Builder for converting variant values to primitive Arrow arrays. It is used by both
/// `VariantToArrowRowBuilder` (below) and `VariantToShreddedPrimitiveVariantRowBuilder` (in
/// `shred_variant.rs`).
pub(crate) enum PrimitiveVariantToArrowRowBuilder<'a> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think traits is the way to reduce some of this boiler plate, but that comes with its own downsides as you have previously.

Another thought I have is why not unify them all into a single enum (why wrap the PrimitiveVariantToArrowRowBuilder 🤔 )

@scovich
Copy link
Contributor Author

scovich commented Sep 18, 2025

It seems like there is some non trivial follow on work that we should perhaps track in tickets. If you agree I can file them

1. Shredding out Lists

2. Shredding out Structs

3. (maybe) benchmarks for shredding.

I think we have (in no particular order):

  1. Pathing into shredded lists in variant_get (columnar, fancy array slicing)
  2. Pathing into unshredded lists in variant_get (row-oriented, new builder)
  3. List support in `shred_variant' (row-oriented, new builder)
  4. Unshredded struct support in variant_get (row-oriented, new builder)
  5. Benchmarks for variant_get (pathing into shredded values)
  6. Benchmarks for variant_get (pathing into unshredded values)
  7. Benchmarks for shred_variant

@scovich
Copy link
Contributor Author

scovich commented Sep 18, 2025

For benchmarks, I am persionally very interested in the case of documents like this and shredding out the time and hostname columns. I am curious if you know of other potential usecases

{ 
  "time": "2025-01-01",
  "hostname": "host1"
  "message": "blah"
  "random_field1": 134
  "random_field2": 231
}
  ...
{ 
  "time": "2025-01-01",
  "hostname": "host1"
  "message": "blah"
  "random_field1": 134
  "random_field2": 231
}

I think shredding would typically pull in all fields that are common in most/all rows? (I mean you can choose to shred however you want, but any hope of skipping row groups in a parquet file needs stats which needs the column to be shredded). So it might make sense to have each row of data involve a subset of fields with varying names and/or types.

Also, some of the rows should occasionally have the "wrong" type wrt the shredding schema.

Copy link
Member

@klion26 klion26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for this great work.

field.data_type(),
cast_options,
capacity,
top_level,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does top_level mean the typed_value not located in the nested-level of the current variant? Do we need to change the value here? Seems top_level in shred_variant.rs did not change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think you're on to something. Digging and wil get back to you.

Copy link
Contributor Author

@scovich scovich Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: It looks like top_level is a vestige of some early approach, trying to solve a problem that no longer exists. I gave the object shredding unit test a significant upgrade, where it now checks every single null/value of every single column in the shredded schema. That test already exercised the top-level NULL vs. missing nested object field NULL cases, and it continues to pass after I remove the top_level concept, so I think we're good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing that code was a poor decision in retrospect... it looks like this caused

And reinstating the code fixes the problem:


// Row 7: Object with only a "wrong" field
assert!(!value_field.is_null(7));
assert!(score_typed_values.is_null(7));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to assert typed_value_struct.is_null(7) here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not null. All its fields are null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test has been updated to exhaustively check every column.

}

#[test]
fn test_object_shredding_comprehensive() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is very nice!

@scovich scovich requested a review from alamb September 18, 2025 16:01
@scovich
Copy link
Contributor Author

scovich commented Sep 18, 2025

@alamb -- this should be ready for final review+merge

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @scovich -- I think this is a great building block! I started writing a benchmark and hit the fact this kernel doesn't yet support shredding out columns as Utf8/Utf8View

typed_value: Option<ArrayRef>,
) -> Result<Self, ArrowError> {
/// Create a new `ShreddingState` from the given fields
pub fn new(value: Option<BinaryViewArray>, typed_value: Option<ArrayRef>) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amusingly, I did this exact change in #8392 too.

)));
}
_ => {
return Err(ArrowError::InvalidArgumentError(format!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to write a benchmark for shredding and I hit the fact that I can't shred Utf8 columns (which is fine, we'll do it as a follow on PR) but I will file a ticket to track

@alamb
Copy link
Contributor

alamb commented Sep 19, 2025

I am just going to merge this one to avoid conflicts with concurrent PRs

@scovich do you have a list of follow on tasks needed after this PR to complete shred_variant? I can file tickets to track if so.

@alamb alamb merged commit ca8e31e into apache:main Sep 19, 2025
18 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 19, 2025

Follow on PR with a benchmark is here:

@scovich
Copy link
Contributor Author

scovich commented Sep 19, 2025

@scovich do you have a list of follow on tasks needed after this PR to complete shred_variant? I can file tickets to track if so.

Several that leap to mind for shred_variant:

  • variant string to utf8 (all umpteen flavors)
  • variant array to list (all five flavors)
  • variant binary to binary (all several flavors)
  • the various variant primitive types to their arrow counterparts

(there are probably more, but those should be enough to kick off at least a few more PR)

For variant_get

  • Extracting a struct from binary variant object (we can already extract a shredded struct, but will blow up if the path ends inside a binary variant column)

alamb pushed a commit that referenced this pull request Sep 20, 2025
# Which issue does this PR close?

- Fast-follow for #8366
- Related to #8392

# Rationale for this change

Somehow, #8392 exposes a latent
bug in #8366, which has improper
NULL handling for shredded object fields. The shredding PR originally
attempted to handle this case, but somehow the test did not trigger the
bug and so the (admittedly incomplete) code was removed. See
#8366 (comment). To
be honest, I have no idea how the original ever worked correctly, nor
why the new PR is able to expose the problem.

# What changes are included in this PR?

When used as a top-level builder,
`VariantToShreddedVariantRowBuilder::append_null` must append NULL to
its own `NullBufferBuilder`; but when used as a shredded object field
builder, it must append non-NULL. Plumb a new `top_level` parameter
through the various functions and into the two sub-builders it relies
on, so they can implement the correct semantics.

# Are these changes tested?

In theory, yes (I don't know how the object shredding test ever passed).
And it fixes the breakage in
#8392.

# Are there any user-facing changes?

No
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] Implement a shred_variant function

3 participants