Skip to content

gh-95555: Support Unicode property escapes \p{...} in regular expressions#151969

Open
serhiy-storchaka wants to merge 2 commits into
python:mainfrom
serhiy-storchaka:re-properties
Open

gh-95555: Support Unicode property escapes \p{...} in regular expressions#151969
serhiy-storchaka wants to merge 2 commits into
python:mainfrom
serhiy-storchaka:re-properties

Conversation

@serhiy-storchaka

@serhiy-storchaka serhiy-storchaka commented Jun 23, 2026

Copy link
Copy Markdown
Member

Add support for \p{property} and \P{property} escapes in Unicode (str) regular expressions, for the properties the engine can resolve without the unicodedata database. They are matched either as CATEGORY opcodes (character predicates and combinations of them) or as fixed sets of character ranges, so neither the matcher nor the compiler gains a unicodedata dependency.

Supported in this change:

  • many General_Category values — the groups L, N, Z, C and the values Lu, Lt, Lm, Nd, Nl, No, Zs, Zl, Zp, Cc, Cf, Cs, Co and Cn;
  • the binary properties Alphabetic, Lowercase, Uppercase, Numeric, Printable, XID_Start, XID_Continue, Cased and Case_Ignorable;
  • the POSIX compatibility classes alpha, alnum, blank, cntrl, digit, graph, lower, print, space, upper, word and xdigit;
  • the code-point classes ASCII, Any, Assigned, Noncharacter_Code_Point, Join_Control and the immutable Pattern_Syntax and Pattern_White_Space.

Property and value names use loose matching (UAX #44 UAX44-LM3), and a property may be spelled \p{Lu}, \p{gc=Lu} or \p{name=yes}.

The remaining table-based properties (the General_Category values Ll/Lo and the M/P/S families, Block, and the other enumerated properties) require the unicodedata tables and are intentionally left out of this first change, to be added separately.

…xpressions

Add support for \p{property} and \P{property} in Unicode (str) regular
expressions, for the properties the engine can resolve without the
unicodedata database.  They are matched either as CATEGORY opcodes
(character predicates and combinations of them, see sre.c) or as fixed
sets of character ranges.

Supported properties:

* many General_Category values -- the groups L, N, Z, C and the values Lu,
  Lt, Lm, Nd, Nl, No, Zs, Zl, Zp, Cc, Cf, Cs, Co and Cn;
* the binary properties Alphabetic, Lowercase, Uppercase, Numeric,
  Printable, XID_Start, XID_Continue, Cased and Case_Ignorable;
* the POSIX compatibility classes alpha, alnum, blank, cntrl, digit, graph,
  lower, print, space, upper, word and xdigit;
* the code-point classes ASCII, Any, Assigned, Noncharacter_Code_Point,
  Join_Control and the immutable Pattern_Syntax and Pattern_White_Space.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@read-the-docs-community

read-the-docs-community Bot commented Jun 23, 2026

Copy link
Copy Markdown

Documentation build overview

📚 cpython-previews | 🛠️ Build #33267275 | 📁 Comparing 388a3b6 against main (868d9a8)

  🔍 Preview build  

4 files changed
± library/dialog.html
± library/re.html
± whatsnew/3.16.html
± whatsnew/changelog.html

Comment thread Doc/library/re.rst
Matches ``[0-9]`` if the :py:const:`~re.ASCII` flag is used.

__ https://www.unicode.org/versions/Unicode15.0.0/ch04.pdf#G134153
__ https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-4/#G124142

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be added to:

# The Unicode Database
# --------------------
# When changing UCD version please update
# * Doc/library/stdtypes.rst, and
# * Doc/library/unicodedata.rst
# * Doc/reference/lexical_analysis.rst (three occurrences)
UNIDATA_VERSION = "17.0.0"

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unrelated?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is, if you're updating the link in this PR.

Comment thread Lib/re/_properties.py
@@ -0,0 +1,267 @@
#
# Secret Labs' Regular Expression Engine

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't written by the company, nor is it licensed to them?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think (but I am not sure), the Secret Labs credit is an internal joke. I can drop it if nobody can confirm this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a joke? It was a real company, founded by Fredrik Lundh. See his bio here, for reference.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems I was wrong, it was a real Swedish company of Fredrik Lundh (a.k.a. "the effbot"). References to it are everywhere in the code, even in the module name _sre. So I'll leave it. We're extending their engine, not re-attributing it.

… properties

They are complete fixed sets, matched as fixed ranges: Regional_Indicator
(the 26 symbols A..Z), ASCII_Hex_Digit (the ASCII hex digits, = POSIX
xdigit) and Hex_Digit (which adds the fullwidth forms).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants