feat(string_util): make ToLower Unicode-aware via utf8proc (2/2)#760
feat(string_util): make ToLower Unicode-aware via utf8proc (2/2)#760goel-skd wants to merge 2 commits into
Conversation
bec8884 to
69cc006
Compare
Replace the ASCII-only ToLower with utf8proc simple case mapping so case-insensitive name handling matches Iceberg Java's toLowerCase(Locale.ROOT). ToUpper stays ASCII-only since it is not used for name matching. EqualsIgnoreCase now compares lowercased forms. Wire utf8proc into both the CMake (vendored/system) and Meson builds. See apache#613.
69cc006 to
f42e2da
Compare
wgtmac
left a comment
There was a problem hiding this comment.
I haven't checked the implementation and test yet. Just post some thoughts around APIs.
| /// \brief Upper-case ASCII letters; non-ASCII (multibyte UTF-8) bytes pass through | ||
| /// unchanged. | ||
| /// | ||
| /// Unlike ToLower this is ASCII-only, since upper-casing is not used for name matching. |
There was a problem hiding this comment.
Unlike ToLower this is ASCII-only, since upper-casing is not used for name matching.
This is misleading to users. We should probably provide two pairs of functions like ToLowerAscii/ToUpperAscii and ToLowerUtf8/ToUpperUtf8.
There was a problem hiding this comment.
ASCII is a subset of UTF-8, so the UTF-8 ToLower is already correct on ASCII input. A ToLowerAscii wouldn't be more correct, only faster on known-ASCII data, and it makes the caller decide "is my input ASCII?", which is sort of a foot-gun. One ToLower also matches Iceberg Java, which only has toLowerCase(Locale.ROOT) with no ascii/utf8 split.
I suggest that we skip ToUpperUtf8 either way. Simple-mapping uppercase is just wrong for things like ß (stays ß, should be SS), and nothing here uppercases non-ASCII anyway. The two ToUpper callers just normalize codec/mode strings to match GZIP/ALL.
You're right it's misleading though. I'll rename ToUpper -> ToUpperAscii so it's in the name, and keep ToLower as the Unicode one. WDYT?
There was a problem hiding this comment.
My only concern on this is the naming. I don't think it is good if one is ToXXXAscii but the other is ToXXXUtf8. It would be good to just use ToUpper and ToLower when we only have a single pair and describe their behavior clearly in the comment.
There was a problem hiding this comment.
@wgtmac I agree. What you are suggesting will be least confusing. I will make comment more descriptive along with changing the names of the functions
|
|
||
| /// \brief Lower-case a UTF-8 string using Unicode simple case mapping. | ||
| /// | ||
| /// Mirrors Iceberg Java's case-insensitive handling, which lower-cases names with |
There was a problem hiding this comment.
I would not say this is Mirrors Iceberg Java's case-insensitive handling since we are not a 100% exact matching. If we cannot guarantee that, we need to clarify where we might diverge in the comment.
There was a problem hiding this comment.
Agree, will post the differences
| static bool EqualsIgnoreCase(std::string_view lhs, std::string_view rhs) { | ||
| return std::ranges::equal( | ||
| lhs, rhs, [](char lc, char rc) { return ToLowerAscii(lc) == ToLowerAscii(rc); }); | ||
| return ToLower(lhs) == ToLower(rhs); |
There was a problem hiding this comment.
This incur performance regression (if they are ASCII) because we cannot return early if their prefixes do not equal. Perhaps we can still implement this in a streaming approach?
There was a problem hiding this comment.
Checked the callers, they're all short ASCII on cold paths (metrics mode, "true", a couple UUIDs, a header name), nothing per row. Short strings stay in SSO so ToLower doesn't allocate, and the UUID compare runs once per commit, so I believe there isn't a measurable regression here.
Iceberg Java does the same thing: case-insensitive lookup just lowercases the query and hits a map
(Schema.caseInsensitiveFindField), and that map is built by lowercasing every field name once (TypeUtil.indexByLowerCaseName). No streaming equalsIgnoreCase anywhere; the speed comes from caching the lowercased keys, which we already do in schema.cc/type.cc. So the hot path is covered and EqualsIgnoreCase is just the cold fallback.
Streaming is also trickier than it looks since you can't early-out on length: simple lowercasing can change byte length (İ → i goes 2 bytes to 1).
I suggest we keep this as is and revisit if a hot-path caller comes up.
Thanks much @wgtmac. I responded to your comments. |
Replaces the ASCII-only
StringUtils::ToLowerwith a Unicode-awareimplementation backed by utf8proc,
so case-insensitive name handling matches Iceberg Java's
toLowerCase(Locale.ROOT).ToLowernow lower-cases UTF-8 input using utf8proc simple (1:1) casemapping (e.g.
CAFÉ→café,GROẞE→große). Invalid UTF-8 isreturned unchanged rather than erroring.
EqualsIgnoreCasenow compares the lowercased forms of both inputs, so itis case-insensitive for non-ASCII letters too.
ToUpperis intentionally left ASCII-only — it is not used for namematching.
utf8proc is wired into both the CMake (vendored via FetchContent / system
package) and Meson (
subprojects/utf8proc.wrap) builds.Testing
Added/updated
string_util_test.cc:ToLowerUnicode,ToUpperAsciiOnly,and Unicode cases in
EqualsIgnoreCase(including invalid-UTF-8pass-through).
Closes #613.
Follow-up to #748