feat(string_util): make ToLower Unicode-aware via utf8proc (2/2) by goel-skd · Pull Request #760 · apache/iceberg-cpp

goel-skd · 2026-06-19T01:30:41Z

Replaces the ASCII-only StringUtils::ToLower with a Unicode-aware
implementation backed by utf8proc,
so case-insensitive name handling matches Iceberg Java's
toLowerCase(Locale.ROOT).

ToLower now lower-cases UTF-8 input using utf8proc simple (1:1) case
mapping (e.g. CAFÉ → café, GROẞE → große). Invalid UTF-8 is
returned unchanged rather than erroring.
EqualsIgnoreCase now compares the lowercased forms of both inputs, so it
is case-insensitive for non-ASCII letters too.
ToUpper is intentionally left ASCII-only — it is not used for name
matching.
utf8proc is wired into both the CMake (vendored via FetchContent / system
package) and Meson (subprojects/utf8proc.wrap) builds.

Testing
Added/updated string_util_test.cc: ToLowerUnicode, ToUpperAsciiOnly,
and Unicode cases in EqualsIgnoreCase (including invalid-UTF-8
pass-through).

Closes #613.

Follow-up to #748

Replace the ASCII-only ToLower with utf8proc simple case mapping so case-insensitive name handling matches Iceberg Java's toLowerCase(Locale.ROOT). ToUpper stays ASCII-only since it is not used for name matching. EqualsIgnoreCase now compares lowercased forms. Wire utf8proc into both the CMake (vendored/system) and Meson builds. See apache#613.

wgtmac

I haven't checked the implementation and test yet. Just post some thoughts around APIs.

wgtmac · 2026-06-21T15:10:01Z

+  /// \brief Upper-case ASCII letters; non-ASCII (multibyte UTF-8) bytes pass through
+  /// unchanged.
+  ///
+  /// Unlike ToLower this is ASCII-only, since upper-casing is not used for name matching.


Unlike ToLower this is ASCII-only, since upper-casing is not used for name matching.

This is misleading to users. We should probably provide two pairs of functions like ToLowerAscii/ToUpperAscii and ToLowerUtf8/ToUpperUtf8.

ASCII is a subset of UTF-8, so the UTF-8 ToLower is already correct on ASCII input. A ToLowerAscii wouldn't be more correct, only faster on known-ASCII data, and it makes the caller decide "is my input ASCII?", which is sort of a foot-gun. One ToLower also matches Iceberg Java, which only has toLowerCase(Locale.ROOT) with no ascii/utf8 split.

I suggest that we skip ToUpperUtf8 either way. Simple-mapping uppercase is just wrong for things like ß (stays ß, should be SS), and nothing here uppercases non-ASCII anyway. The two ToUpper callers just normalize codec/mode strings to match GZIP/ALL.

You're right it's misleading though. I'll rename ToUpper -> ToUpperAscii so it's in the name, and keep ToLower as the Unicode one. WDYT?

My only concern on this is the naming. I don't think it is good if one is ToXXXAscii but the other is ToXXXUtf8. It would be good to just use ToUpper and ToLower when we only have a single pair and describe their behavior clearly in the comment.

@wgtmac I agree. What you are suggesting will be least confusing. I will make comment more descriptive along with changing the names of the functions

wgtmac · 2026-06-21T15:12:48Z

-
+  /// \brief Lower-case a UTF-8 string using Unicode simple case mapping.
+  ///
+  /// Mirrors Iceberg Java's case-insensitive handling, which lower-cases names with


I would not say this is Mirrors Iceberg Java's case-insensitive handling since we are not a 100% exact matching. If we cannot guarantee that, we need to clarify where we might diverge in the comment.

Agree, will post the differences

wgtmac · 2026-06-21T15:14:24Z

  static bool EqualsIgnoreCase(std::string_view lhs, std::string_view rhs) {
-    return std::ranges::equal(
-        lhs, rhs, [](char lc, char rc) { return ToLowerAscii(lc) == ToLowerAscii(rc); });
+    return ToLower(lhs) == ToLower(rhs);


This incur performance regression (if they are ASCII) because we cannot return early if their prefixes do not equal. Perhaps we can still implement this in a streaming approach?

Checked the callers, they're all short ASCII on cold paths (metrics mode, "true", a couple UUIDs, a header name), nothing per row. Short strings stay in SSO so ToLower doesn't allocate, and the UUID compare runs once per commit, so I believe there isn't a measurable regression here.

Iceberg Java does the same thing: case-insensitive lookup just lowercases the query and hits a map
(Schema.caseInsensitiveFindField), and that map is built by lowercasing every field name once (TypeUtil.indexByLowerCaseName). No streaming equalsIgnoreCase anywhere; the speed comes from caching the lowercased keys, which we already do in schema.cc/type.cc. So the hot path is covered and EqualsIgnoreCase is just the cold fallback.

Streaming is also trickier than it looks since you can't early-out on length: simple lowercasing can change byte length (İ → i goes 2 bytes to 1).

I suggest we keep this as is and revisit if a hot-path caller comes up.

goel-skd · 2026-06-24T04:11:00Z

I haven't checked the implementation and test yet. Just post some thoughts around APIs.

Thanks much @wgtmac. I responded to your comments.

goel-skd force-pushed the feat-613-unicode-lowercase branch from bec8884 to 69cc006 Compare June 19, 2026 01:40

goel-skd force-pushed the feat-613-unicode-lowercase branch from 69cc006 to f42e2da Compare June 19, 2026 02:13

wgtmac reviewed Jun 19, 2026

View reviewed changes

Comment thread cmake_modules/IcebergThirdpartyToolchain.cmake

Add license info to LICENSE

b8639d6

wgtmac requested changes Jun 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(string_util): make ToLower Unicode-aware via utf8proc (2/2)#760

feat(string_util): make ToLower Unicode-aware via utf8proc (2/2)#760
goel-skd wants to merge 2 commits into
apache:mainfrom
goel-skd:feat-613-unicode-lowercase

goel-skd commented Jun 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

wgtmac left a comment

Uh oh!

wgtmac Jun 21, 2026

Uh oh!

goel-skd Jun 24, 2026

Uh oh!

wgtmac Jun 25, 2026

Uh oh!

goel-skd Jun 25, 2026

Uh oh!

wgtmac Jun 21, 2026

Uh oh!

goel-skd Jun 24, 2026

Uh oh!

wgtmac Jun 21, 2026

Uh oh!

goel-skd Jun 24, 2026

Uh oh!

goel-skd commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

goel-skd commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Uh oh!

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

goel-skd commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

goel-skd commented Jun 19, 2026 •

edited

Loading