Skip to content

Parquet reader forces Decimal32/64 columns to Decimal128 #8549

@scovich

Description

@scovich

Describe the bug

The parquet reader always widens INT32 (Decimal) and INT64 (Decimal) columns to Decimal128. It might be possible for a reader to specifically request the narrower types in their read schema, but if they just read what was written to disk they get back a different type than what was written.

To Reproduce

Several unit tests expect the current behavior, and started failing when I changed the parquet reader to preserve types:

test arrow::array_reader::primitive_array::tests::test_primitive_array_reader_decimal_types ... FAILED
test arrow::arrow_reader::tests::test_decimal ... FAILED
test arrow::arrow_reader::tests::test_arbitrary_decimal ... FAILED
test arrow::arrow_reader::tests::test_read_decimal_file ... FAILED
test arrow::arrow_writer::tests::arrow_writer_decimal128_dictionary ... FAILED
test arrow::arrow_writer::tests::arrow_writer_decimal ... FAILED
test arrow::arrow_writer::tests::arrow_writer_decimal256_dictionary ... FAILED
test arrow::arrow_writer::tests::arrow_writer_decimal64_dictionary ... FAILED
test arrow::schema::tests::test_arrow_schema_roundtrip ... FAILED
test arrow::schema::tests::test_decimal_fields ... FAILED
test arrow::schema::tests::test_column_desc_to_field ... FAILED
test statistics::test_decimal128 ... FAILED
test statistics::test_decimal_256 ... FAILED
test statistics::test_decimal64 ... FAILED
test statistics::test_data_page_stats_with_all_null_page ... FAILED

Expected behavior

If it was written as INT32 (Decimal) it should come back as Decimal32 unless the user requested something else via the read schema. Ditto for INT64 (Decimal) as Decimal64.

Additional context

See #8540 (comment)

The variant shredding decimal integration tests fail by default, because they expect 32- and 64-bit decimals to faithfully round trip through parquet; I had to add a manual casting operation VariantArray constructor to compensate.

For the fix I tried (but eventually reverted in favor of the above-mentioned casting), see
https://github.com/apache/arrow-rs/pull/8540/files/cd978f5374503fff3a265bc06ad6ce143a4110f3..5b55f121d7461bd31dfc2d204f8fc303523a3988

The idea was to split the decimal_type helper (which only understood 128- and 256-bit decimals) into decimal_[32|64|128|256]_type helpers, where each helper's name indicates a lower bound on the resulting decimal's width. Too-high precision can still force a wider decimal, tho it's not clear to me that this is actually correct -- is e.g. INT32 (Decimal(15, 2)) legal?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions