-
Notifications
You must be signed in to change notification settings - Fork 1k
fix(parquet): converting parquet schema with backward compatible repeated struct/primitive with provided arrow schema #8496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ated struct/primitive with provided arrow schema closes: - apache#8495
…primitive-with-inferred-schema # Conflicts: # parquet/src/arrow/schema/complex.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @rluvaton -- I took a quick review of this PR and the code looks reasonable to me, but I don't understand the legacy inferring logic / problem so I can't really review this PR fully yet
Can someone help me out with a link / document that describes what the legacy inferring is? Is it https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/LogicalTypes.md?plain=1#L718-L723
Poking around that file, it looks like @etseidl may know something about this as he authored several commits, for example
parquet/src/arrow/schema/complex.rs
Outdated
/// Converts `self` into an arrow list, with its current type as the field type | ||
/// accept an optional `list_data_type` to specify the type of list to create | ||
/// | ||
/// This is used to convert deprecated repeated columns (not in a list), into their arrow representation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reference we can add a link to (I am not familiar with the "deprecated repeated columns" you are referencing here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly, updated
parquet/src/arrow/schema/complex.rs
Outdated
| Some(DataType::LargeList(field_hint)) | ||
| Some(DataType::FixedSizeList(field_hint, _)) => Some(field_hint.as_ref()), | ||
Some(_) => unreachable!( | ||
"should be validated earlier that list_data_type is only a type of list" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even though this should be impossible, I worry about panic'ing here because if there is a bug that error is more severe than "just an error"
ALso, while this may be true at the moment, I can imagine that some future refactor messes it up, in which case this may become reachable and the compiler won't complain
I would prefer returning a general_err!
with some sort of "Internal error: should be validated..." type message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
I will try and find time tomorrow to review this |
Which issue does this PR close?
Rationale for this change
Fix reading old parquet files
What changes are included in this PR?
tests and the fix, but mostly tests.
Are these changes tested?
yes
Are there any user-facing changes?
No