feat(schema): represent, serialize and validate v3 column default values (1/4)#746
feat(schema): represent, serialize and validate v3 column default values (1/4)#746huan233usc wants to merge 9 commits into
Conversation
445f7ae to
4aade00
Compare
| std::shared_ptr<const Literal> initial_default_; | ||
| std::shared_ptr<const Literal> write_default_; |
There was a problem hiding this comment.
ReassignField constructs a new SchemaField via the 5-argument constructor which initializes initial_default_ and write_default_ to nullptr. When schema IDs are reassigned (e.g., copying a schema with fresh IDs via the Schema(get_id) path), all default values on fields are silently lost. We should copy all field properties including initialDefault and writeDefault.
There was a problem hiding this comment.
Good catch, confirmed. Defaults are now constructor args, and ReassignField passes the source field's initial_default_ptr()/write_default_ptr() through, so they're shared with the reassigned field, not lost. Added ReassignIdsPreservesDefaultValues.
| if (initial_default_json.has_value()) { | ||
| ICEBERG_ASSIGN_OR_RAISE(Literal literal, | ||
| LiteralFromJson(*initial_default_json, field.type().get())); | ||
| field = field.WithInitialDefault(std::move(literal)); | ||
| } | ||
| if (write_default_json.has_value()) { | ||
| ICEBERG_ASSIGN_OR_RAISE(Literal literal, | ||
| LiteralFromJson(*write_default_json, field.type().get())); | ||
| field = field.WithWriteDefault(std::move(literal)); | ||
| } |
There was a problem hiding this comment.
The deserialization first constructs a bare SchemaField, then conditionally calls WithInitialDefault/WithWriteDefault, each of which copies the entire field (including the shared_ptr<Type>). This is an unnecessary intermediate copy.
There was a problem hiding this comment.
Agreed — FieldFromJson now parses the defaults first and builds the field in one construction. Intermediate copy gone.
| SchemaField SchemaField::WithInitialDefault(Literal initial_default) const { | ||
| SchemaField copy = *this; | ||
| copy.initial_default_ = std::make_shared<const Literal>(std::move(initial_default)); | ||
| return copy; | ||
| } |
There was a problem hiding this comment.
I don't think it's need to copy the whole SchemaField, can we just set the initial_default_ field and return *this.
Also the following With methods.
There was a problem hiding this comment.
Agreed — moved defaults into the constructor and removed both With... methods, so construction no longer copies the field.
1ee5b32 to
34470af
Compare
There was a problem hiding this comment.
Pull request overview
Adds Iceberg v3 column default value support at the schema layer by carrying, JSON-(de)serializing, validating, and projecting initial-default / write-default literals (foundation for later read/write-path work).
Changes:
- Extend
SchemaFieldto storeinitial-default/write-defaultas shared immutable literals and include them in equality/ID-reassignment rebuilds. - Add JSON serde for
initial-default/write-defaultusing existing single-value literal serialization. - Update schema projection to use
FieldProjection::Kind::kDefaultwhen an expected field is missing but hasinitial-default, and add/extend unit tests + v3 metadata fixture.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/iceberg/test/schema_util_test.cc | Adds projection tests for missing fields with initial-default and for ignoring defaults when the field is present. |
| src/iceberg/test/schema_test.cc | Adds schema validation/version-gating and ID-reassignment preservation tests for default values. |
| src/iceberg/test/schema_json_test.cc | Adds round-trip and failure tests for JSON serialization/parsing of default values (incl. nested structs). |
| src/iceberg/test/resources/TableMetadataV3Valid.json | Adds a v3-valid table metadata JSON fixture. |
| src/iceberg/schema.cc | Preserves defaults during ID reassignment and adds default-related validation in Schema::Validate. |
| src/iceberg/schema_util.cc | Projects missing fields with initial-default as kDefault rather than error/null. |
| src/iceberg/schema_field.h | Extends SchemaField API/storage to carry default literals and expose them via optional reference accessors. |
| src/iceberg/schema_field.cc | Implements default accessors, validation of defaults, and includes defaults in equality. |
| src/iceberg/json_serde.cc | Serializes/parses initial-default / write-default on schema fields via literal single-value serde. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
First of a multi-part split of column default value support (apache#730) — the schema foundation the read and evolution paths build on. Purely additive; no read/write behavior change on its own. - SchemaField carries `initial-default` / `write-default` (immutable std::shared_ptr<const Literal>) with copy-preserving WithInitialDefault / WithWriteDefault modifiers; getters return optional<reference_wrapper>. - JSON serde reads/writes `initial-default` / `write-default` via the existing single-value serialization. - Schema::Validate rejects default values below format v3 and validates they are non-null primitive literals matching the field type. - Generic schema projection maps a column missing from a data file with an initial-default to FieldProjection::Kind::kDefault. Read-path application (Parquet/Avro) and schema evolution follow in separate PRs. See apache#731 for the full end-to-end proof-of-concept.
34470af to
f663e0e
Compare
This file was added by this PR but is referenced by no test (table-metadata resources are loaded by explicit name via ReadTableMetadataFromResource) and carries no column defaults, so it is dead. Remove it.
Add SchemaField default-value coverage directly to json_serde_test (the binary that exercises json_serde.cc): one test pins the initial-default / write-default wire format (incl. their absence when unset), and one round-trips a field with both defaults through ToJson -> FieldFromJson for every primitive type, exercising the non-trivial date/timestamp/decimal/uuid/binary encodings the default path reuses.
…-schema # Conflicts: # src/iceberg/test/json_serde_test.cc
wgtmac
left a comment
There was a problem hiding this comment.
Static review notes; I did not build or run tests.
| std::shared_ptr<const Literal> initial_default; | ||
| if (initial_default_json.has_value()) { | ||
| ICEBERG_ASSIGN_OR_RAISE(Literal literal, | ||
| LiteralFromJson(*initial_default_json, type.get())); |
There was a problem hiding this comment.
Default-value parsing should enforce the JSON single-value rule for timestamptz and timestamptz_ns. The shared timestamp parser accepts offsets such as +05:00, but the spec and Java SingleValueParser only accept UTC (+00:00) for these default values. As written, C++ can accept schema metadata that Java rejects and then silently normalize it on write.
There was a problem hiding this comment.
Fixed in 180a9f9 — FieldFromJson now rejects non-UTC offsets for timestamptz/timestamptz_ns default values (new TemporalUtils::IsUtcOffset, which reuses the existing timezone-suffix parser). This follows Java iiuc
| } | ||
| if (field.type() == nullptr || !field.type()->is_primitive()) { | ||
| return InvalidSchema( | ||
| "Invalid {} value for {}: default values are only supported for primitive types", |
There was a problem hiding this comment.
This rejects all non-primitive defaults. The spec allows JSON single-value defaults for structs/lists/maps, and a struct field may have a non-null empty struct default ({}) while sub-field defaults live in field metadata. If C++ is intentionally matching the current Java model for this first patch, please call out that this remains a spec gap; otherwise this will reject legal v3 schema metadata.
There was a problem hiding this comment.
Intentional for this patch — it matches Java's current primitive-only model. Added a comment in ValidateDefault marking struct/list/map single-value defaults as a known spec gap / follow-up (180a9f9).
| "Invalid {} value for {}: default values are only supported for primitive types", | ||
| kind, field.name()); | ||
| } | ||
| if (*value.type() != *field.type()) { |
There was a problem hiding this comment.
Java casts the provided default literal to the field type in Types.NestedField (Literal.of(34) can become a long/date default, string literals can become date/time/timestamp/uuid, etc.). This check requires exact type equality, so schemas built through the C++ API can be rejected even when Java accepts and normalizes the same default.
There was a problem hiding this comment.
Fixed in 180a9f9 — ValidateDefault now casts the default to the field type via Literal::CastTo (e.g. int -> long) instead of requiring exact type equality, matching Java's Types.NestedField. Only uncastable or out-of-range defaults are rejected.
| cases.emplace_back(decimal(9, 2), Literal::Decimal(12345, 9, 2)); | ||
| cases.emplace_back(fixed(3), Literal::Fixed({0x01, 0x02, 0x03})); | ||
| cases.emplace_back(binary(), Literal::Binary({0xDE, 0xAD, 0xBE, 0xEF})); | ||
| cases.emplace_back(uuid(), Literal::UUID(uuid_value)); |
There was a problem hiding this comment.
This round-trip list should include time, timestamptz, timestamp_ns, and timestamptz_ns, plus a negative case for non-UTC timestamptz defaults. Those are the encodings most likely to drift from Java/spec behavior; the current cases do not cover the offset rule above.
There was a problem hiding this comment.
Done in 180a9f9 — added time/timestamptz/timestamp_ns/timestamptz_ns to the round-trip cases, plus a negative test (SchemaFieldRejectsNonUtcTimestamptzDefault) for non-UTC timestamptz defaults.
Addresses wgtmac's review feedback: - Enforce UTC offset for timestamptz / timestamptz_ns default values in FieldFromJson. The shared timestamp parser accepts any offset and silently normalizes to UTC, so C++ could accept default metadata Java rejects and rewrite the offset on write. Added TemporalUtils::IsUtcOffset, which reuses the existing timezone-suffix parser. - ValidateDefault now casts the default literal to the field type (Literal::CastTo) instead of requiring an exact type match, matching Java's Types.NestedField (e.g. an int default on a long field). Uncastable or out-of-range defaults are still rejected. - Documented that non-primitive (struct/list/map) defaults remain unsupported, matching Java's current model. - Extended the json_serde round-trip tests with time/timestamptz/timestamp_ns/ timestamptz_ns and a negative case for non-UTC timestamptz defaults; added a schema test for the cast-to-field-type behavior.
Directly cover the UTC/non-UTC/unparseable offset cases the timestamptz default-value enforcement relies on.
The AMD64 Windows job failed in its cleanup step (rm -rf example/build: Device or resource busy) after the example compiled and linked successfully; re-run CI.
Part 1 of a multi-part split of #730 (column default values, item 2 of #637). The full
end-to-end implementation is in #731, kept open as the proof-of-concept; this series
lands it in reviewable pieces.
This PR is the schema foundation — representing, serializing and validating v3
column default values. It is purely additive and changes no read or write behavior on
its own.
What's in this PR
SchemaFieldcarriesinitial-default/write-default, stored asstd::shared_ptr<const Literal>(immutable payload shared across copies, like theadjacent
type_; the C++ analog of Java'sfinal Literal<?>). They are set via theconstructor. Getters return
std::optional<std::reference_wrapper<const Literal>>forreading (the
Schema::FindFieldByNameidiom);initial_default_ptr()/write_default_ptr()expose the shared pointer so a rebuilt field (e.g. IDreassignment) shares the value instead of copying it.
initial-default/write-defaultusing the existingsingle-value serialization (all primitive types).
Schema::Validate: version-gates theinitial-defaultto format v3(
kMinFormatVersionDefaultValues) — it reinterprets how existing data files are read,so it requires the v3 reader contract. The
write-defaultonly affects values writtengoing forward and is not version-gated (matching Java's
Schema.checkCompatibility,which gates only the initial default). Both defaults are otherwise validated to be
non-null primitive literals matching the field type.
initial-defaultmaps to
FieldProjection::Kind::kDefaultcarrying the literal (the per-format readersconsume this in the follow-up PRs).
Follow-ups (stacked on this PR)
literal_util+ parquet projection/materialization)UpdateSchemaadd/update column defaults)Testing
Added tests