gh-152100: Fuse set-operation character classes into a single charset by serhiy-storchaka · Pull Request #152214 · python/cpython

serhiy-storchaka · 2026-06-25T15:31:12Z

A compile-time optimization for the set operations added in gh-152100. No behaviour change -- same matches, fewer engine steps.

The parser lowers set difference [A--B] to A(?<![B]): a character class followed by a negative lookbehind. But the engine's charset() walk treats every NEGATE as a polarity toggle (see Modules/_sre/sre_lib.h), so one character set can express the difference directly -- [NEGATE] B [NEGATE] A matches A minus B in a single charset test instead of a charset match plus a lookbehind rescan.

A new pass in Lib/re/_optimizer.py performs two folds:

Set difference [A--B] (or an explicit A(?<![B])) into the single charset [NEGATE] B [NEGATE] A. _optimize_charset is made segment-aware so the interior NEGATE compiles correctly; the parser keeps emitting the plain A(?<![B]) form.
A union with a non-flat operand such as [0-9||[a-z--b]] is left by the parser as a BRANCH it cannot merge; once its alternatives are all one-character matchers, their item lists are concatenated into one IN.

Besides removing the lookbehind rescan, the fused charset also becomes the INFO prefix, so the engine can fast-skip non-matching positions. On scan-heavy input this is several times faster (findall over 50 KB of non-matching text, best of 7):

[a-z--aeiou]       5.6x
[\w--\d]           5.3x
[0-9||[a-z--b]]    9.4x

On match-dense input the per-match win is smaller, masked by match-object construction.

🤖 Generated with Claude Code

…harset Add a compile-time optimization pass (Lib/re/_optimizer.py) that rewrites set-operation character classes into a single character set where the engine's charset() representation allows it. charset() treats every NEGATE as a polarity toggle, so a mid-list NEGATE expresses set difference and a flat run expresses union. Set difference -- [A--B], emitted by the parser as A(?<![B]) -- fuses into the charset [NEGATE] B [NEGATE] A, matching A minus B in one test instead of a charset match plus a lookbehind rescan. _optimize_charset is made segment-aware so the interior NEGATE compiles correctly. A union with a non-flat operand, such as [0-9||[a-z--b]], is emitted by the parser as a BRANCH that it cannot merge. Once its alternatives are all one-character matchers, their item lists are concatenated into a single IN. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

bedevere-app Bot mentioned this pull request Jun 25, 2026

Support set operations in regular expression character classes #152100

Open

bedevere-app Bot added the awaiting core review label Jun 25, 2026

serhiy-storchaka added the skip news label Jun 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-152100: Fuse set-operation character classes into a single charset#152214

gh-152100: Fuse set-operation character classes into a single charset#152214
serhiy-storchaka wants to merge 1 commit into
python:mainfrom
serhiy-storchaka:re-set-ops

serhiy-storchaka commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

serhiy-storchaka commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant