[Variant] Define new shred_variant function #8366

scovich · 2025-09-17T05:37:27Z

Which issue does this PR close?

Closes [Variant] Implement a shred_variant function #8361

Rationale for this change

See ticket.

What changes are included in this PR?

Define a new shred_variant function and implement support for structs and a subset of primitive types.

Are these changes tested?

Yes, extensive new unit tests

Are there any user-facing changes?

The new function is public.

scovich · 2025-09-17T05:38:02Z

CC @alamb -- very interested in your thoughts here.

scovich · 2025-09-17T11:29:51Z

parquet-variant-compute/src/variant_to_arrow.rs

+/// Builder for converting variant values to primitive Arrow arrays. It is used by both
+/// `VariantToArrowRowBuilder` (below) and `VariantToShreddedPrimitiveVariantRowBuilder` (in
+/// `shred_variant.rs`).
+pub(crate) enum PrimitiveVariantToArrowRowBuilder<'a> {


I don't love splitting this out, but it seemed better than copying the whole thing for both parent builders that use the same logic? Very open to ideas for a better approach.

(the new definition of VariantToArrowRowBuilder is at L95 below)

I think traits is the way to reduce some of this boiler plate, but that comes with its own downsides as you have previously.

Another thought I have is why not unify them all into a single enum (why wrap the PrimitiveVariantToArrowRowBuilder 🤔 )

why not unify them all into a single enum

I started out that way, but both builders handle primitive types separately from complex types, and the handling of complex types is different in the two type trees:

PrimitiveVariantToArrowRowBuilder

VariantToShreddedPrimitiveVariantRowBuilder::typed_value

VariantToShreddedVariantRowBuilder::Primitive

NOTE: The corresponding Object enum variant needs to (partially) shred rows into value and typed_value columns according to the shredding schema.

VariantToArrowRowBuilder::Primitive

NOTE: The corresponding Object enum variant would eventually need to convert rows into structs, blowing up (or producing NULL) if any field has the wrong type or if there are any unexpected fields.

scovich · 2025-09-17T11:34:19Z

parquet-variant-compute/src/variant_to_arrow.rs

-    path: VariantPath<'a>,
-    data_type: Option<&'a DataType>,
+/// Creates a primitive row builder, returning Err if the requested data type is not primitive.
+pub(crate) fn make_primitive_variant_to_arrow_row_builder<'a>(


A new helper for creating the new primitive row builder enum

(the original helper moved to L175 below)

scovich · 2025-09-17T11:36:16Z

parquet-variant-compute/src/variant_to_arrow.rs

    }

-    fn append_value(&mut self, value: &Variant<'_, '_>) -> Result<bool> {
+    fn append_value(&mut self, value: Variant<'_, '_>) -> Result<bool> {


Now that the primitive row builder is separate a separate API from the normal row builder, we can make the latter take &Variant while this can go back to taking Variant.

(again below)

parquet-variant-compute/src/shred_variant.rs

scovich · 2025-09-17T13:33:49Z

CC @codephage2020 and @klion26 since this touches on work you've both done (or are doing)

alamb

Looks great to me -- thank you @scovich

Once we merge in the stacked PRs I think this will be ready to go -- I had a few suggestions but I think they are all pretty easy

It seems like there is some non trivial follow on work that we should perhaps track in tickets. If you agree I can file them

Shredding out Lists
Shredding out Structs
(maybe) benchmarks for shredding.

For benchmarks, I am persionally very interested in the case of documents like this and shredding out the time and hostname columns. I am curious if you know of other potential usecases

{ 
  time: 2025-01-01,
  hostname: "host1"
  message: "blah"
  random_field1: 134
  random_field2: 231
}
...
{ 
  time: 2025-01-01,
  hostname: "host1"
  message: "blah"
  random_field1: 134
  random_field2: 231
}

parquet-variant-compute/src/shred_variant.rs

alamb · 2025-09-17T19:09:00Z

parquet-variant-compute/src/shred_variant.rs

+
+    if array.value_field().is_none() {
+        // all-null case
+        return Ok(VariantArray::from_parts(


Is this the same as VariantArray.clone() 🤔

Yes it is, good catch

... except that VariantArray doesn't impl Clone

alamb · 2025-09-17T19:10:21Z

parquet-variant-compute/src/shred_variant.rs

+        | DataType::ListView(_)
+        | DataType::LargeListView(_)
+        | DataType::FixedSizeList(..) => {
+            // TODO: Special handling for shredded variant arrays


this isn't really about shredded variant arrays, right? It is more like

Suggested change

// TODO: Special handling for shredded variant arrays

// TODO: handling for structured arrays

Not sure I follow? Or maybe we're just respectively looking at it from input vs. output perspective?

After this PR, we will be able to shred primitive variant values (into primitive arrays) and object variant values (into struct arrays), but not (yet) shred array variant values (into one of the various list array types).

Oh, I think the wording might be confusing, I'll adjust

parquet-variant-compute/src/shred_variant.rs

alamb · 2025-09-17T19:25:36Z

parquet-variant-compute/src/variant_to_arrow.rs

+/// Builder for converting variant values to primitive Arrow arrays. It is used by both
+/// `VariantToArrowRowBuilder` (below) and `VariantToShreddedPrimitiveVariantRowBuilder` (in
+/// `shred_variant.rs`).
+pub(crate) enum PrimitiveVariantToArrowRowBuilder<'a> {


I think traits is the way to reduce some of this boiler plate, but that comes with its own downsides as you have previously.

Another thought I have is why not unify them all into a single enum (why wrap the PrimitiveVariantToArrowRowBuilder 🤔 )

parquet-variant-compute/src/shred_variant.rs

parquet-variant-compute/src/variant_array.rs

scovich · 2025-09-18T04:23:33Z

It seems like there is some non trivial follow on work that we should perhaps track in tickets. If you agree I can file them
1. Shredding out Lists

2. Shredding out Structs

3. (maybe) benchmarks for shredding.

I think we have (in no particular order):

Pathing into shredded lists in variant_get (columnar, fancy array slicing)
- [Variant] Support Shredded Lists/Array in variant_get #8082
- [WIP] Support Shredded Lists/Array in variant_get #8354
Pathing into unshredded lists in variant_get (row-oriented, new builder)
List support in `shred_variant' (row-oriented, new builder)
Unshredded struct support in variant_get (row-oriented, new builder)
Benchmarks for variant_get (pathing into shredded values)
Benchmarks for variant_get (pathing into unshredded values)
Benchmarks for shred_variant

scovich · 2025-09-18T04:29:58Z

For benchmarks, I am persionally very interested in the case of documents like this and shredding out the time and hostname columns. I am curious if you know of other potential usecases
{ 
  "time": "2025-01-01",
  "hostname": "host1"
  "message": "blah"
  "random_field1": 134
  "random_field2": 231
}
  ...
{ 
  "time": "2025-01-01",
  "hostname": "host1"
  "message": "blah"
  "random_field1": 134
  "random_field2": 231
}

I think shredding would typically pull in all fields that are common in most/all rows? (I mean you can choose to shred however you want, but any hope of skipping row groups in a parquet file needs stats which needs the column to be shredded). So it might make sense to have each row of data involve a subset of fields with varying names and/or types.

Also, some of the rows should occasionally have the "wrong" type wrt the shredding schema.

klion26

LGTM, thanks for this great work.

klion26 · 2025-09-18T05:24:58Z

parquet-variant-compute/src/shred_variant.rs

+                field.data_type(),
+                cast_options,
+                capacity,
+                top_level,


Does top_level mean the typed_value not located in the nested-level of the current variant? Do we need to change the value here? Seems top_level in shred_variant.rs did not change

Hmm, I think you're on to something. Digging and wil get back to you.

Update: It looks like top_level is a vestige of some early approach, trying to solve a problem that no longer exists. I gave the object shredding unit test a significant upgrade, where it now checks every single null/value of every single column in the shredded schema. That test already exercised the top-level NULL vs. missing nested object field NULL cases, and it continues to pass after I remove the top_level concept, so I think we're good.

Removing that code was a poor decision in retrospect... it looks like this caused

Add Arrow Variant Extension Type, remove Array impl for VariantArray and ShreddedVariantFieldArray #8392 (comment)

And reinstating the code fixes the problem:

[Variant] Fix NULL handling for shredded object fields #8395

klion26 · 2025-09-18T06:23:56Z

parquet-variant-compute/src/shred_variant.rs

+
+        // Row 7: Object with only a "wrong" field
+        assert!(!value_field.is_null(7));
+        assert!(score_typed_values.is_null(7));


Do we need to assert typed_value_struct.is_null(7) here?

It's not null. All its fields are null.

The test has been updated to exhaustively check every column.

klion26 · 2025-09-18T06:30:33Z

parquet-variant-compute/src/shred_variant.rs

+    }
+
+    #[test]
+    fn test_object_shredding_comprehensive() {


The test is very nice!

scovich · 2025-09-18T16:02:07Z

@alamb -- this should be ready for final review+merge

alamb

Thank you @scovich -- I think this is a great building block! I started writing a benchmark and hit the fact this kernel doesn't yet support shredding out columns as Utf8/Utf8View

alamb · 2025-09-19T19:23:15Z

parquet-variant-compute/src/variant_array.rs

-        typed_value: Option<ArrayRef>,
-    ) -> Result<Self, ArrowError> {
+    /// Create a new `ShreddingState` from the given fields
+    pub fn new(value: Option<BinaryViewArray>, typed_value: Option<ArrayRef>) -> Self {


Amusingly, I did this exact change in #8392 too.

alamb · 2025-09-19T19:50:57Z

parquet-variant-compute/src/variant_to_arrow.rs

            )));
        }
+        _ => {
+            return Err(ArrowError::InvalidArgumentError(format!(


I tried to write a benchmark for shredding and I hit the fact that I can't shred Utf8 columns (which is fine, we'll do it as a follow on PR) but I will file a ticket to track

alamb · 2025-09-19T19:55:23Z

I am just going to merge this one to avoid conflicts with concurrent PRs

@scovich do you have a list of follow on tasks needed after this PR to complete shred_variant? I can file tickets to track if so.

alamb · 2025-09-19T20:21:10Z

Follow on PR with a benchmark is here:

[Variant] Benchmark for shred_variant kernels #8394

scovich · 2025-09-19T21:50:45Z

@scovich do you have a list of follow on tasks needed after this PR to complete shred_variant? I can file tickets to track if so.

Several that leap to mind for shred_variant:

variant string to utf8 (all umpteen flavors)
variant array to list (all five flavors)
variant binary to binary (all several flavors)
the various variant primitive types to their arrow counterparts

(there are probably more, but those should be enough to kick off at least a few more PR)

For variant_get

Extracting a struct from binary variant object (we can already extract a shredded struct, but will blow up if the path ends inside a binary variant column)

# Which issue does this PR close? - Fast-follow for #8366 - Related to #8392 # Rationale for this change Somehow, #8392 exposes a latent bug in #8366, which has improper NULL handling for shredded object fields. The shredding PR originally attempted to handle this case, but somehow the test did not trigger the bug and so the (admittedly incomplete) code was removed. See #8366 (comment). To be honest, I have no idea how the original ever worked correctly, nor why the new PR is able to expose the problem. # What changes are included in this PR? When used as a top-level builder, `VariantToShreddedVariantRowBuilder::append_null` must append NULL to its own `NullBufferBuilder`; but when used as a shredded object field builder, it must append non-NULL. Plumb a new `top_level` parameter through the various functions and into the two sub-builders it relies on, so they can implement the correct semantics. # Are these changes tested? In theory, yes (I don't know how the object shredding test ever passed). And it fixes the breakage in #8392. # Are there any user-facing changes? No

scovich added 10 commits September 16, 2025 04:03

[Variant] Add constants for empty variant metadata

ab77733

make const instead of static

428aae1

[Variant] Implement new VariantValueArrayBuilder

97f99da

fix doctest

fb455c0

remove unneeded unused markers

2238b49

doc fixes

474fa31

Merge remote-tracking branch 'oss/main' into variant-value-builder

5959577

review feedback

f286c52

binary variant row builder uses variant value builder

a476c72

[Variant] Define new shred_variant function

7640d50

github-actions bot added the parquet-variant parquet-variant* crates label Sep 17, 2025

scovich mentioned this pull request Sep 17, 2025

[Variant] Implement new VariantValueArrayBuilder #8360

Merged

self review fixes

d5349d6

scovich commented Sep 17, 2025

View reviewed changes

alamb reviewed Sep 17, 2025

View reviewed changes

scovich added 2 commits September 17, 2025 22:07

review feedback

7778942

Merge remote-tracking branch 'oss/main' into shred-variant

4641d66

klion26 approved these changes Sep 18, 2025

View reviewed changes

scovich added 4 commits September 18, 2025 00:37

review feedback

d007b88

Merge remote-tracking branch 'oss/main' into shred-variant

eaf630f

remove top_level

f6001e8

fmt

370dac1

scovich requested a review from alamb September 18, 2025 16:01

fix unit test

8af3810

fmt

cf862e4

alamb approved these changes Sep 19, 2025

View reviewed changes

alamb merged commit ca8e31e into apache:main Sep 19, 2025
18 checks passed

alamb mentioned this pull request Sep 19, 2025

[Variant] Benchmark for shred_variant kernels #8394

Draft

This was referenced Sep 19, 2025

[Variant] Add low level support for shredding and unshredding #7715

Closed

Add Arrow Variant Extension Type, remove Array impl for VariantArray and ShreddedVariantFieldArray #8392

Merged

scovich mentioned this pull request Sep 20, 2025

[Variant] Fix NULL handling for shredded object fields #8395

Merged

alamb mentioned this pull request Sep 21, 2025

[Variant] API to construct Shredded Variant Arrays #7895

Closed

	// TODO: Special handling for shredded variant arrays
	// TODO: handling for structured arrays

[Variant] Define new shred_variant function #8366

[Variant] Define new shred_variant function #8366

Uh oh!

Conversation

scovich commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

scovich commented Sep 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

scovich commented Sep 17, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

scovich commented Sep 18, 2025

Uh oh!

scovich commented Sep 18, 2025

Uh oh!

klion26 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

scovich commented Sep 17, 2025 •

edited

Loading

scovich Sep 18, 2025 •

edited

Loading

scovich Sep 18, 2025 •

edited

Loading