Skip to content

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented May 8, 2025

Rationale for this change

Resolves #1979

Are these changes tested?

Are there any user-facing changes?

@Fokko
Copy link
Contributor Author

Fokko commented May 8, 2025

CI failing due to the release of snowballstemmer: https://pypi.org/project/snowballstemmer/#history

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the fix. Added a few notes to the review for my future self and other reviewers :)

with fo.create(overwrite=True) as fos:
with pq.ParquetWriter(fos, schema=arrow_table.schema, **parquet_writer_kwargs) as writer:
with pq.ParquetWriter(
fos, schema=arrow_table.schema, store_decimal_as_integer=True, **parquet_writer_kwargs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"""
By default, this is DISABLED and all decimal types annotate fixed_len_byte_array. When enabled, the writer will use the following physical types to store decimals:

  • int32: for 1 <= precision <= 9.
  • int64: for 10 <= precision <= 18.
  • fixed_len_byte_array: for precision > 18.
    """

from https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html
Screenshot 2025-05-09 at 8 35 41 AM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this matches the parquet data type mapping for decimal
https://iceberg.apache.org/spec/#parquet

Screenshot 2025-05-09 at 8 36 40 AM

@kevinjqliu
Copy link
Contributor

FAILED tests/io/test_pyarrow_stats.py::test_metrics_primitive_types - pyarrow.lib.ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema conversion: decimal32(5, 2)

Looks like arrow doesnt have decimal32 support yet
apache/arrow#25483
apache/arrow#43956

@Fokko
Copy link
Contributor Author

Fokko commented May 10, 2025

@kevinjqliu Good call, looking at the code, it seems like it will automatically map it to INT32/INT64: https://github.com/apache/arrow/blob/598938711a8376cbfdceaf5c77ab0fd5057e6c02/cpp/src/parquet/arrow/schema.cc#L380-L392

@Fokko Fokko requested a review from kevinjqliu May 10, 2025 20:58
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for adding the comments.

I double checked that decimal128 also errors for precision >38

>>> pyarrow.decimal128(38+1, 0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/types.pxi", line 4718, in pyarrow.lib.decimal128
  File "pyarrow/types.pxi", line 4768, in pyarrow.lib.decimal128
ValueError: precision should be between 1 and 38

@kevinjqliu kevinjqliu merged commit 260ef54 into apache:main May 11, 2025
10 checks passed
@Fokko Fokko deleted the fd-fix-decimal branch May 12, 2025 07:17
gabeiglio pushed a commit to Netflix/iceberg-python that referenced this pull request Aug 13, 2025
# Rationale for this change

Resolves apache#1979

# Are these changes tested?

# Are there any user-facing changes?

<!-- In the case of user-facing changes, please add the changelog label.
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

While inserting decimal: ValueError: Unexpected physical type FIXED_LEN_BYTE_ARRAY

2 participants