-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Describe the bug
The parquet reader always widens INT32 (Decimal)
and INT64 (Decimal)
columns to Decimal128
. It might be possible for a reader to specifically request the narrower types in their read schema, but if they just read what was written to disk they get back a different type than what was written.
To Reproduce
Several unit tests expect the current behavior, and started failing when I changed the parquet reader to preserve types:
test arrow::array_reader::primitive_array::tests::test_primitive_array_reader_decimal_types ... FAILED
test arrow::arrow_reader::tests::test_decimal ... FAILED
test arrow::arrow_reader::tests::test_arbitrary_decimal ... FAILED
test arrow::arrow_reader::tests::test_read_decimal_file ... FAILED
test arrow::arrow_writer::tests::arrow_writer_decimal128_dictionary ... FAILED
test arrow::arrow_writer::tests::arrow_writer_decimal ... FAILED
test arrow::arrow_writer::tests::arrow_writer_decimal256_dictionary ... FAILED
test arrow::arrow_writer::tests::arrow_writer_decimal64_dictionary ... FAILED
test arrow::schema::tests::test_arrow_schema_roundtrip ... FAILED
test arrow::schema::tests::test_decimal_fields ... FAILED
test arrow::schema::tests::test_column_desc_to_field ... FAILED
test statistics::test_decimal128 ... FAILED
test statistics::test_decimal_256 ... FAILED
test statistics::test_decimal64 ... FAILED
test statistics::test_data_page_stats_with_all_null_page ... FAILED
Expected behavior
If it was written as INT32 (Decimal)
it should come back as Decimal32
unless the user requested something else via the read schema. Ditto for INT64 (Decimal)
as Decimal64
.
Additional context
See #8540 (comment)
The variant shredding decimal integration tests fail by default, because they expect 32- and 64-bit decimals to faithfully round trip through parquet; I had to add a manual casting operation VariantArray
constructor to compensate.
For the fix I tried (but eventually reverted in favor of the above-mentioned casting), see
https://github.com/apache/arrow-rs/pull/8540/files/cd978f5374503fff3a265bc06ad6ce143a4110f3..5b55f121d7461bd31dfc2d204f8fc303523a3988
The idea was to split the decimal_type
helper (which only understood 128- and 256-bit decimals) into decimal_[32|64|128|256]_type
helpers, where each helper's name indicates a lower bound on the resulting decimal's width. Too-high precision can still force a wider decimal, tho it's not clear to me that this is actually correct -- is e.g. INT32 (Decimal(15, 2))
legal?