Skip to content

gh-152100: Fuse set-operation character classes into a single charset#152214

Open
serhiy-storchaka wants to merge 1 commit into
python:mainfrom
serhiy-storchaka:re-set-ops
Open

gh-152100: Fuse set-operation character classes into a single charset#152214
serhiy-storchaka wants to merge 1 commit into
python:mainfrom
serhiy-storchaka:re-set-ops

Conversation

@serhiy-storchaka

Copy link
Copy Markdown
Member

A compile-time optimization for the set operations added in gh-152100. No behaviour change -- same matches, fewer engine steps.

The parser lowers set difference [A--B] to A(?<![B]): a character class followed by a negative lookbehind. But the engine's charset() walk treats every NEGATE as a polarity toggle (see Modules/_sre/sre_lib.h), so one character set can express the difference directly -- [NEGATE] B [NEGATE] A matches A minus B in a single charset test instead of a charset match plus a lookbehind rescan.

A new pass in Lib/re/_optimizer.py performs two folds:

  • Set difference [A--B] (or an explicit A(?<![B])) into the single charset [NEGATE] B [NEGATE] A. _optimize_charset is made segment-aware so the interior NEGATE compiles correctly; the parser keeps emitting the plain A(?<![B]) form.
  • A union with a non-flat operand such as [0-9||[a-z--b]] is left by the parser as a BRANCH it cannot merge; once its alternatives are all one-character matchers, their item lists are concatenated into one IN.

Besides removing the lookbehind rescan, the fused charset also becomes the INFO prefix, so the engine can fast-skip non-matching positions. On scan-heavy input this is several times faster (findall over 50 KB of non-matching text, best of 7):

[a-z--aeiou]       5.6x
[\w--\d]           5.3x
[0-9||[a-z--b]]    9.4x

On match-dense input the per-match win is smaller, masked by match-object construction.

🤖 Generated with Claude Code

…harset

Add a compile-time optimization pass (Lib/re/_optimizer.py) that rewrites
set-operation character classes into a single character set where the
engine's charset() representation allows it.  charset() treats every NEGATE
as a polarity toggle, so a mid-list NEGATE expresses set difference and a
flat run expresses union.

Set difference -- [A--B], emitted by the parser as A(?<![B]) -- fuses into
the charset [NEGATE] B [NEGATE] A, matching A minus B in one test instead of
a charset match plus a lookbehind rescan.  _optimize_charset is made
segment-aware so the interior NEGATE compiles correctly.

A union with a non-flat operand, such as [0-9||[a-z--b]], is emitted by the
parser as a BRANCH that it cannot merge.  Once its alternatives are all
one-character matchers, their item lists are concatenated into a single IN.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant