Skip to content

feat(schema): represent, serialize and validate v3 column default values (1/4)#746

Open
huan233usc wants to merge 9 commits into
apache:mainfrom
huan233usc:feat/default-values-schema
Open

feat(schema): represent, serialize and validate v3 column default values (1/4)#746
huan233usc wants to merge 9 commits into
apache:mainfrom
huan233usc:feat/default-values-schema

Conversation

@huan233usc

@huan233usc huan233usc commented Jun 15, 2026

Copy link
Copy Markdown

Part 1 of a multi-part split of #730 (column default values, item 2 of #637). The full
end-to-end implementation is in #731, kept open as the proof-of-concept; this series
lands it in reviewable pieces.

This PR is the schema foundation — representing, serializing and validating v3
column default values. It is purely additive and changes no read or write behavior on
its own.

What's in this PR

  • SchemaField carries initial-default / write-default, stored as
    std::shared_ptr<const Literal> (immutable payload shared across copies, like the
    adjacent type_; the C++ analog of Java's final Literal<?>). They are set via the
    constructor. Getters return std::optional<std::reference_wrapper<const Literal>> for
    reading (the Schema::FindFieldByName idiom); initial_default_ptr() /
    write_default_ptr() expose the shared pointer so a rebuilt field (e.g. ID
    reassignment) shares the value instead of copying it.
  • JSON serde: parse/write initial-default / write-default using the existing
    single-value serialization (all primitive types).
  • Schema::Validate: version-gates the initial-default to format v3
    (kMinFormatVersionDefaultValues) — it reinterprets how existing data files are read,
    so it requires the v3 reader contract. The write-default only affects values written
    going forward and is not version-gated (matching Java's Schema.checkCompatibility,
    which gates only the initial default). Both defaults are otherwise validated to be
    non-null primitive literals matching the field type.
  • Generic projection: a column missing from a data file with an initial-default
    maps to FieldProjection::Kind::kDefault carrying the literal (the per-format readers
    consume this in the follow-up PRs).

Follow-ups (stacked on this PR)

  • read path — Parquet (literal_util + parquet projection/materialization)
  • read path — Avro
  • schema evolution (UpdateSchema add/update column defaults)

Testing

Added tests

Comment on lines +148 to +149
std::shared_ptr<const Literal> initial_default_;
std::shared_ptr<const Literal> write_default_;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReassignField constructs a new SchemaField via the 5-argument constructor which initializes initial_default_ and write_default_ to nullptr. When schema IDs are reassigned (e.g., copying a schema with fresh IDs via the Schema(get_id) path), all default values on fields are silently lost. We should copy all field properties including initialDefault and writeDefault.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, confirmed. Defaults are now constructor args, and ReassignField passes the source field's initial_default_ptr()/write_default_ptr() through, so they're shared with the reassigned field, not lost. Added ReassignIdsPreservesDefaultValues.

Comment thread src/iceberg/json_serde.cc
Comment on lines +571 to +580
if (initial_default_json.has_value()) {
ICEBERG_ASSIGN_OR_RAISE(Literal literal,
LiteralFromJson(*initial_default_json, field.type().get()));
field = field.WithInitialDefault(std::move(literal));
}
if (write_default_json.has_value()) {
ICEBERG_ASSIGN_OR_RAISE(Literal literal,
LiteralFromJson(*write_default_json, field.type().get()));
field = field.WithWriteDefault(std::move(literal));
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deserialization first constructs a bare SchemaField, then conditionally calls WithInitialDefault/WithWriteDefault, each of which copies the entire field (including the shared_ptr<Type>). This is an unnecessary intermediate copy.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — FieldFromJson now parses the defaults first and builds the field in one construction. Intermediate copy gone.

Comment thread src/iceberg/schema_field.cc Outdated
Comment on lines +76 to +80
SchemaField SchemaField::WithInitialDefault(Literal initial_default) const {
SchemaField copy = *this;
copy.initial_default_ = std::make_shared<const Literal>(std::move(initial_default));
return copy;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's need to copy the whole SchemaField, can we just set the initial_default_ field and return *this.
Also the following With methods.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — moved defaults into the constructor and removed both With... methods, so construction no longer copies the field.

@huan233usc huan233usc force-pushed the feat/default-values-schema branch 3 times, most recently from 1ee5b32 to 34470af Compare June 16, 2026 05:30
@huan233usc huan233usc requested a review from WZhuo June 16, 2026 05:38

@WZhuo WZhuo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Iceberg v3 column default value support at the schema layer by carrying, JSON-(de)serializing, validating, and projecting initial-default / write-default literals (foundation for later read/write-path work).

Changes:

  • Extend SchemaField to store initial-default / write-default as shared immutable literals and include them in equality/ID-reassignment rebuilds.
  • Add JSON serde for initial-default / write-default using existing single-value literal serialization.
  • Update schema projection to use FieldProjection::Kind::kDefault when an expected field is missing but has initial-default, and add/extend unit tests + v3 metadata fixture.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/iceberg/test/schema_util_test.cc Adds projection tests for missing fields with initial-default and for ignoring defaults when the field is present.
src/iceberg/test/schema_test.cc Adds schema validation/version-gating and ID-reassignment preservation tests for default values.
src/iceberg/test/schema_json_test.cc Adds round-trip and failure tests for JSON serialization/parsing of default values (incl. nested structs).
src/iceberg/test/resources/TableMetadataV3Valid.json Adds a v3-valid table metadata JSON fixture.
src/iceberg/schema.cc Preserves defaults during ID reassignment and adds default-related validation in Schema::Validate.
src/iceberg/schema_util.cc Projects missing fields with initial-default as kDefault rather than error/null.
src/iceberg/schema_field.h Extends SchemaField API/storage to carry default literals and expose them via optional reference accessors.
src/iceberg/schema_field.cc Implements default accessors, validation of defaults, and includes defaults in equality.
src/iceberg/json_serde.cc Serializes/parses initial-default / write-default on schema fields via literal single-value serde.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/iceberg/schema_field.cc
Comment thread src/iceberg/schema.cc
First of a multi-part split of column default value support (apache#730) — the
schema foundation the read and evolution paths build on. Purely additive;
no read/write behavior change on its own.

- SchemaField carries `initial-default` / `write-default` (immutable
  std::shared_ptr<const Literal>) with copy-preserving WithInitialDefault /
  WithWriteDefault modifiers; getters return optional<reference_wrapper>.
- JSON serde reads/writes `initial-default` / `write-default` via the
  existing single-value serialization.
- Schema::Validate rejects default values below format v3 and validates
  they are non-null primitive literals matching the field type.
- Generic schema projection maps a column missing from a data file with an
  initial-default to FieldProjection::Kind::kDefault.

Read-path application (Parquet/Avro) and schema evolution follow in separate
PRs. See apache#731 for the full end-to-end proof-of-concept.
@huan233usc huan233usc force-pushed the feat/default-values-schema branch from 34470af to f663e0e Compare June 18, 2026 17:21
Comment thread src/iceberg/test/resources/TableMetadataV3Valid.json Outdated
Comment thread src/iceberg/json_serde.cc
This file was added by this PR but is referenced by no test (table-metadata
resources are loaded by explicit name via ReadTableMetadataFromResource) and
carries no column defaults, so it is dead. Remove it.
Add SchemaField default-value coverage directly to json_serde_test (the
binary that exercises json_serde.cc): one test pins the initial-default /
write-default wire format (incl. their absence when unset), and one round-trips
a field with both defaults through ToJson -> FieldFromJson for every primitive
type, exercising the non-trivial date/timestamp/decimal/uuid/binary encodings
the default path reuses.
@huan233usc huan233usc requested a review from zhjwpku June 20, 2026 18:10
@huan233usc

Copy link
Copy Markdown
Author

Thank you @zhjwpku and @WZhuo for review!
Hi @wgtmac, do you mind take another look as you requested copilot to review on your behalf. Thanks

@wgtmac wgtmac left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Static review notes; I did not build or run tests.

Comment thread src/iceberg/json_serde.cc
std::shared_ptr<const Literal> initial_default;
if (initial_default_json.has_value()) {
ICEBERG_ASSIGN_OR_RAISE(Literal literal,
LiteralFromJson(*initial_default_json, type.get()));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default-value parsing should enforce the JSON single-value rule for timestamptz and timestamptz_ns. The shared timestamp parser accepts offsets such as +05:00, but the spec and Java SingleValueParser only accept UTC (+00:00) for these default values. As written, C++ can accept schema metadata that Java rejects and then silently normalize it on write.

@huan233usc huan233usc Jun 23, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 180a9f9FieldFromJson now rejects non-UTC offsets for timestamptz/timestamptz_ns default values (new TemporalUtils::IsUtcOffset, which reuses the existing timezone-suffix parser). This follows Java iiuc

}
if (field.type() == nullptr || !field.type()->is_primitive()) {
return InvalidSchema(
"Invalid {} value for {}: default values are only supported for primitive types",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rejects all non-primitive defaults. The spec allows JSON single-value defaults for structs/lists/maps, and a struct field may have a non-null empty struct default ({}) while sub-field defaults live in field metadata. If C++ is intentionally matching the current Java model for this first patch, please call out that this remains a spec gap; otherwise this will reject legal v3 schema metadata.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional for this patch — it matches Java's current primitive-only model. Added a comment in ValidateDefault marking struct/list/map single-value defaults as a known spec gap / follow-up (180a9f9).

Comment thread src/iceberg/schema_field.cc Outdated
"Invalid {} value for {}: default values are only supported for primitive types",
kind, field.name());
}
if (*value.type() != *field.type()) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java casts the provided default literal to the field type in Types.NestedField (Literal.of(34) can become a long/date default, string literals can become date/time/timestamp/uuid, etc.). This check requires exact type equality, so schemas built through the C++ API can be rejected even when Java accepts and normalizes the same default.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 180a9f9ValidateDefault now casts the default to the field type via Literal::CastTo (e.g. int -> long) instead of requiring exact type equality, matching Java's Types.NestedField. Only uncastable or out-of-range defaults are rejected.

cases.emplace_back(decimal(9, 2), Literal::Decimal(12345, 9, 2));
cases.emplace_back(fixed(3), Literal::Fixed({0x01, 0x02, 0x03}));
cases.emplace_back(binary(), Literal::Binary({0xDE, 0xAD, 0xBE, 0xEF}));
cases.emplace_back(uuid(), Literal::UUID(uuid_value));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This round-trip list should include time, timestamptz, timestamp_ns, and timestamptz_ns, plus a negative case for non-UTC timestamptz defaults. Those are the encodings most likely to drift from Java/spec behavior; the current cases do not cover the offset rule above.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 180a9f9 — added time/timestamptz/timestamp_ns/timestamptz_ns to the round-trip cases, plus a negative test (SchemaFieldRejectsNonUtcTimestamptzDefault) for non-UTC timestamptz defaults.

Addresses wgtmac's review feedback:

- Enforce UTC offset for timestamptz / timestamptz_ns default values in
  FieldFromJson. The shared timestamp parser accepts any offset and silently
  normalizes to UTC, so C++ could accept default metadata Java rejects and
  rewrite the offset on write. Added TemporalUtils::IsUtcOffset, which reuses
  the existing timezone-suffix parser.
- ValidateDefault now casts the default literal to the field type (Literal::CastTo)
  instead of requiring an exact type match, matching Java's Types.NestedField
  (e.g. an int default on a long field). Uncastable or out-of-range defaults are
  still rejected.
- Documented that non-primitive (struct/list/map) defaults remain unsupported,
  matching Java's current model.
- Extended the json_serde round-trip tests with time/timestamptz/timestamp_ns/
  timestamptz_ns and a negative case for non-UTC timestamptz defaults; added a
  schema test for the cast-to-field-type behavior.
Directly cover the UTC/non-UTC/unparseable offset cases the timestamptz
default-value enforcement relies on.
The AMD64 Windows job failed in its cleanup step (rm -rf example/build:
Device or resource busy) after the example compiled and linked successfully;
re-run CI.
@huan233usc huan233usc requested a review from wgtmac June 23, 2026 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants