Bug Fix: Cast UINT32 to INT32 to ensure compatibility with other engines #3529
Bug Fix: Cast UINT32 to INT32 to ensure compatibility with other engines #3529JeroenSchmidt wants to merge 2 commits into
Conversation
|
Hi @JeroenSchmidt - thanks for contributing this PR! This seems consistent with #2799’s precedent of normalizing Arrow integer input to Iceberg’s canonical signed physical type for write compatibility. On the edge case you highlighted: could we add a regression test showing that uint32 values > INT32_MAX fail rather than wrap/truncate? |
Thank you @sungwy for having a look. |
Rationale for this change
ArrowProjectionVisitor._cast_if_neededto handle unsigned-to-signed conversions at the same bit width. Specifically uint32 -> int32.Context:
This is a follow-up to #2799 (which fixed #2791) where uint8/uint16 casting was addressed. That fix only covered widening conversions (
source_width < target_width), which missed the uint32 case since both uint32 and int32 are 32-bit. Without this cast, Parquet files are written with theUINT_32physical type while Iceberg metadata declaresINT_32, causing Spark to fail on read.Changes / Are these changes tested?
pyiceberg/io/pyarrow.py: Extended the cast condition to also trigger when the source is an unsigned integer with the same (or smaller) bit width as the signed targettests/io/test_pyarrow.py: Added (pa.uint32(),IntegerType(),pa.int32()) test caseNotes
safe=True, so values exceedingINT32_MAX (2^31-1)will raise rather than silently corrupt