Skip to content

Conversation

vegarsti
Copy link

@vegarsti vegarsti commented Oct 11, 2025

Which issue does this PR close?

Rationale for this change

This PR implements casting support for RunEndEncoded arrays in Apache Arrow.

  • Any attempt to cast run-end indices to a narrower integer type will fail immediately if it would result in overflow
  • Narrowing conversions (e.g., from Int64 to Int16) will always fail if any values exceed the target type’s bounds
  • Upcasts (e.g., Int16 → Int32 -> Int64) are allowed, as they are lossless.
  • Widening conversions (e.g., from Int16 to Int64) are allowed, as they are inherently lossless

What changes are included in this PR?

Users can now cast RunEndEncoded arrays using the standard arrow_cast::cast() function

  1. run_end_encoded_cast(): Casts values within existing RunEndEncoded arrays to different types
  2. cast_to_run_end_encoded(): Converts regular arrays to RunEndEncoded format with run-end encoding
  3. Updated can_cast_types() to support RunEndEncoded compatibility rules. Downcasting is not allowed.

Are these changes tested?

Yes!

Are there any user-facing changes?

No breaking changes, just new functionality

@github-actions github-actions bot added the arrow Changes to the arrow crate label Oct 11, 2025
@vegarsti vegarsti changed the title Casting to/from RunEndEncoded arrays Casting support for RunEndEncoded arrays Oct 11, 2025
@vegarsti vegarsti force-pushed the cast-run-end-encoded-arrays branch 2 times, most recently from 87e543d to f9ae6f9 Compare October 11, 2025 06:25
@vegarsti
Copy link
Author

Raised this PR to get Richard Baah's excellent work over the line! cc @albertlockett @brancz @alamb @Rich-T-kid

Copy link
Contributor

@brancz brancz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're getting close to the finish line with these changes!

(false, false) => match value_type {
// Primitive types
DataType::Boolean => {
cast_array.as_boolean().value(i) == cast_array.as_boolean().value(i - 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we invert all these comparisons instead? As in switch on value types, and then cast once and perform all iterations. I find this both hard to read, and we probably perform extra unnecessary work because of that (it's possible the compiler might optimize it away, but I don't think so).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent idea!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a quick and dirty one now in 88c0d8a, there might be a common trait I could use for arrays to avoid repeating a lot of the code - I haven't found that yet.

Casting then iterating instead of casting within the loop is a very good idea, sounds obvious after hearing it!

Copy link
Author

@vegarsti vegarsti Oct 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a second iteration this morning in 2358010 (I've replaced the commit above). Tried to follow a somewhat similar approach as is done with dictionaries.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a macro in 692f6ea

rich-t-kid-datadog and others added 10 commits October 12, 2025 07:49
Implement casting between REE arrays and other Arrow types. REE-to-REE casting
validates run-end upcasts only (Int16→Int32, Int16→Int64, Int32→Int64) to prevent
invalid sequences.
Implement casting between REE arrays and other Arrow types. REE-to-REE casting
validates run-end upcasts only (Int16→Int32, Int16→Int64, Int32→Int64) to prevent
invalid sequences.

rebased changes
@vegarsti vegarsti force-pushed the cast-run-end-encoded-arrays branch from 88c0d8a to 2358010 Compare October 12, 2025 05:49
@tustvold
Copy link
Contributor

tustvold commented Oct 12, 2025

Is there some way we can avoid the quadratic codegen with code paths parameterized on both run end type and value type? Perhaps it'd be possible to identify where the transitions are, perhaps using the comparison kernels and comparing the array with a slice offset by one, and then use this to construct the indexes and a filter to construct the values array?

Have we done any empirical quantification into the impact this has on code bloat / compile times?

Edit: https://docs.rs/arrow-ord/latest/arrow_ord/partition/fn.partition.html is the function I'm thinking of.

@vegarsti
Copy link
Author

vegarsti commented Oct 13, 2025

Have we done any empirical quantification into the impact this has on code bloat / compile times?

I have not! Happy to do that though. Any pointers to how you'd like me to do that, from previous PRs for example? Or does a basic comparison of compile time and binary size on main and this branch suffice?

@tustvold
Copy link
Contributor

Or does a basic comparison of compile time and binary size on main and this branch suffice?

Just this, quadratic codegen is typically severe enough to be easily measurable.

@vegarsti
Copy link
Author

The compile time increased by 2 seconds.

         cargo build --release
main     569.35s user 23.69s system 863% cpu 1:08.66 total
branch   567.33s user 23.96s system 891% cpu 1:06.33 total

The size of libarrow_cast.rlib increased by 279kb (3.82%)

         libarrow_cast.rlib size
main     7,316,832
branch   7,596,568

@tustvold
Copy link
Contributor

tustvold commented Oct 13, 2025

Yeah... That's quite bad for a single kernel, especially given the relatively niche usage of RunEndEncodedArrays, I hope you can understand that we need to be careful to keep this under control.

What did you think of my suggestion about using the partition kernel to compute the run ends? It might actually be faster and would largely eliminate the additional codegen.

It would mean making arrow-cast depend on arrow-ord, which is a bit meh, but perhaps unavoidable. It could possibly be a feature flag. 🤔

@vegarsti
Copy link
Author

vegarsti commented Oct 13, 2025

I understand! I haven't had time to look into your suggestion, but I will.

Out of curiosity, though, the approach in this PR seems quite similar to the code for dictionaries. Does the dictionary code similarly bloat the binary, and if so, why is that acceptable but not for REE?

@tustvold
Copy link
Contributor

tustvold commented Oct 13, 2025

Out of curiosity, though, the approach in this PR seems quite similar to the code for dictionaries. Does the dictionary code similarly bloat the binary, and if so, why is that acceptable but not for REE?

Dictionaries run into similar challenges, and a lot of effort has been expended trying to mitigate the bloat they cause. For example #3616 #4705 #4701 to name a few. Ultimately it's a compromise, there isn't a way to avoid this bloat and support dictionaries so we pay the tax, with run-end encoded arrays the tax isn't necessary and so it is better we don't pay it.

@vegarsti
Copy link
Author

Out of curiosity, though, the approach in this PR seems quite similar to the code for dictionaries. Does the dictionary code similarly bloat the binary, and if so, why is that acceptable but not for REE?

Dictionaries run into similar challenges, and a lot of effort has been expended trying to mitigate the bloat they cause. For example #3616 #4705 #4701 to name a few. Ultimately it's a compromise, there isn't a way to avoid this bloat and support dictionaries so we pay the tax, with run-end encoded arrays the tax isn't necessary and so it is better we don't pay it.

Thanks for the context!

@brancz
Copy link
Contributor

brancz commented Oct 13, 2025

We're talking about the pack_runs macro right? I realize it's nice as a macro, but it also seems fine to just write out by hand.

@tustvold
Copy link
Contributor

The fact it is a macro is not the issue here, the problem is code generation based with complexity <Number of Index Types> * <Number of Value Types> which results in slow compilation and binary bloat. Whether this is achieved with macros, generics or copy paste doesn't change this 😄

@brancz
Copy link
Contributor

brancz commented Oct 14, 2025

Got it. The way I see it, there are two paths. Either:

  1. We manage to get it to work with arrow_ord's partition and avoid the amount of code required altogether (whether by macro or not)
  2. We do it via macro but put ree casting behind a feature flag, so users who wants this can opt into it (although it feels a little strange to hide features that are very reasonable and available for other types behind a feature flag, but I don't feel strongly one way or another as we'd just enable it and move on).

@vegarsti can you give the arrow-ord partitioning a try so we understand whether this would be a workable path?

@vegarsti
Copy link
Author

vegarsti commented Oct 14, 2025

Yeah, I will give the arrow-ord partitioning a try. Some time this week! It seems like a good approach, thanks @tustvold!

As for the feature flag, to me that seems a bit complicated - either the REE type should be supported or not, imo? Also, unless I'm missing something, whether to put this in a feature flag would apply to the whole REE epic #3520, so that should (eventually) be raised there.

It would be great with some guidelines for the arrow-rs project with regard to the tradeoff between features and size/compile times. I'm guessing opinions might vary a bit between maintainers as well. Guidelines might make it easier to come to alignment in such discussions.

In any case, maybe we get around this issue with the arrow-ord approach 🙏🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants