Optimize more regex cases with IndexOfAny#125586
Optimize more regex cases with IndexOfAny#125586stephentoub wants to merge 1 commit intodotnet:mainfrom
Conversation
Extend TryEmitIndexOf (source generator) and EmitIndexOf/EmitIndexOfWithCharClassFallback (RegexCompiler) to handle complex character classes (Unicode categories like \w, \d, \s, and sets with subtraction) that previously fell through to scalar character-by-character loops. The optimization uses SearchValues<char> for vectorized ASCII scanning with a per-character MatchCharacterClass/CharInClass fallback for non-ASCII. Key changes: - EmitIndexOfAnyCustomHelper gains negate/last parameters, applying negation at the boolean level rather than toggling the set string flag (which breaks subtraction sets). - TryEmitIndexOf always succeeds for set families, falling back to the custom helper. - CanEmitIndexOf simplified to return true for all set families. - EmitFixedSet_LeftToRight's inline IL extracted into reusable EmitIndexOfWithCharClassFallback. - Repeater path changed from fragile Replace(IndexOf,Contains) to >= 0. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@MihuBot regexdiff |
|
@MihuBot benchmark Regex |
|
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions |
There was a problem hiding this comment.
Pull request overview
This PR updates the regex compiler and source generator to better optimize searches involving complex character classes (e.g., \w, \d, \s, and non-enumerable / subtractive sets) by using an ASCII-vectorized SearchValues-based scan with a non-ASCII fallback, and refreshes the generator output baselines accordingly.
Changes:
- Refactors
RegexCompilerto use a sharedEmitIndexOfWithCharClassFallbackimplementation for complex set searches, and broadensCanEmitIndexOfto include all set nodes. - Extends the generator’s custom IndexOf helper emission to support negated searches and last-index variants, and adjusts generator logic to use index comparisons rather than
Contains*string rewrites. - Updates
RegexGeneratorOutputTestsexpected outputs to reflect the new helper methods and emitted code patterns.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/libraries/System.Text.RegularExpressions/tests/FunctionalTests/RegexGeneratorOutputTests.cs | Updates expected generated source baselines to match new IndexOf helper patterns. |
| src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs | Adds a reusable vectorized ASCII + non-ASCII fallback path for complex character classes and uses it in more places. |
| src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs | Enhances emitted helper generation for complex char-class searches (negation + last-index support) and updates IndexOf-based checks. |
You can also share your feedback on Copilot code review. Take the survey.
| // i += direction; | ||
| Ldloc(iLocal); | ||
| Ldc(1); | ||
| if (useLast) { Sub(); } else { Add(); } |
|
7681 out of 18857 patterns have generated source code changes. Examples of GeneratedRegex source diffs"\\s+" (24455 uses)[GeneratedRegex("\\s+")] // Match a whitespace character atomically at least once.
{
- int iteration = 0;
- while ((uint)iteration < (uint)slice.Length && char.IsWhiteSpace(slice[iteration]))
+ int iteration = slice.IndexOfAnyExceptWhiteSpace();
+ if (iteration < 0)
{
- iteration++;
+ iteration = slice.Length;
}
if (iteration == 0)
/// <summary>Whether <see cref="s_defaultTimeout"/> is non-infinite.</summary>
internal static readonly bool s_hasTimeout = s_defaultTimeout != Regex.InfiniteMatchTimeout;
+ /// <summary>Finds the next index of any character that does not match a whitespace character.</summary>
+ [MethodImpl(MethodImplOptions.AggressiveInlining)]
+ internal static int IndexOfAnyExceptWhiteSpace(this ReadOnlySpan<char> span)
+ {
+ int i = span.IndexOfAnyExcept(Utilities.s_asciiWhiteSpace);
+ if ((uint)i < (uint)span.Length)
+ {
+ if (char.IsAscii(span[i]))
+ {
+ return i;
+ }
+
+ do
+ {
+ if (!char.IsWhiteSpace(span[i]))
+ {
+ return i;
+ }
+ i++;
+ }
+ while ((uint)i < (uint)span.Length);
+ }
+
+ return -1;
+ }
+
+ /// <summary>Supports searching for characters in or not in "\t\n\v\f\r ".</summary>
+ internal static readonly SearchValues<char> s_asciiWhiteSpace = SearchValues.Create("\t\n\v\f\r ");
+
/// <summary>Supports searching for characters in or not in "\t\n\v\f\r \u0085 \u2028\u2029 ".</summary>
internal static readonly SearchValues<char> s_whitespace = SearchValues.Create("\t\n\v\f\r \u0085 \u2028\u2029 ");
}"^-?([^-+/*\\(\\)\\^\\s]+)" (17707 uses)[GeneratedRegex("^-?([^-+/*\\(\\)\\^\\s]+)")] {
int pos = base.runtextpos;
int matchStart = pos;
- char ch;
int capture_starting_pos = 0;
ReadOnlySpan<char> slice = inputSpan.Slice(pos);
// Match a character in the set [^(-+\-/^\s] atomically at least once.
{
- int iteration = 0;
- while ((uint)iteration < (uint)slice.Length && ((ch = slice[iteration]) < 128 ? ("쇿\uffff僾\uffff\uffff뿿\uffff\uffff"[ch >> 4] & (1 << (ch & 0xF))) != 0 : RegexRunner.CharInClass((char)ch, "\u0001\b\u0001(,-./0^_d")))
+ int iteration = slice.IndexOfAnyExcept_1C7FD78BD41F1B11FB5695FDEB937D92852BB6BB96E8336434F81435C7D874DE();
+ if (iteration < 0)
{
- iteration++;
+ iteration = slice.Length;
}
if (iteration == 0)
/// <summary>Whether <see cref="s_defaultTimeout"/> is non-infinite.</summary>
internal static readonly bool s_hasTimeout = s_defaultTimeout != Regex.InfiniteMatchTimeout;
+
+ /// <summary>Finds the next index of any character that does not match a character in the set [^(-+\-/^\s].</summary>
+ [MethodImpl(MethodImplOptions.AggressiveInlining)]
+ internal static int IndexOfAnyExcept_1C7FD78BD41F1B11FB5695FDEB937D92852BB6BB96E8336434F81435C7D874DE(this ReadOnlySpan<char> span)
+ {
+ int i = span.IndexOfAnyExcept(Utilities.s_ascii_FFC1FFFFFE50FFFFFFFFFFBFFFFFFFFF);
+ if ((uint)i < (uint)span.Length)
+ {
+ if (char.IsAscii(span[i]))
+ {
+ return i;
+ }
+
+ char ch;
+ do
+ {
+ if (((ch = span[i]) < 128 ? ("쇿\uffff僾\uffff\uffff뿿\uffff\uffff"[ch >> 4] & (1 << (ch & 0xF))) == 0 : !RegexRunner.CharInClass((char)ch, "\u0001\b\u0001(,-./0^_d")))
+ {
+ return i;
+ }
+ i++;
+ }
+ while ((uint)i < (uint)span.Length);
+ }
+
+ return -1;
+ }
+
+ /// <summary>Supports searching for characters in or not in "\0\u0001\u0002\u0003\u0004\u0005\u0006\a\b\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f!\"#$%&',.0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]_`abcdefghijklmnopqrstuvwxyz{|}~\u007f".</summary>
+ internal static readonly SearchValues<char> s_ascii_FFC1FFFFFE50FFFFFFFFFFBFFFFFFFFF = SearchValues.Create("\0\u0001\u0002\u0003\u0004\u0005\u0006\a\b\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f!\"#$%&',.0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]_`abcdefghijklmnopqrstuvwxyz{|}~\u007f");
}
}"\\%(\\d+)!.*?!" (17653 uses)[GeneratedRegex("\\%(\\d+)!.*?!", RegexOptions.Singleline)] // Match a Unicode digit atomically at least once.
{
- int iteration = 0;
- while ((uint)iteration < (uint)slice.Length && char.IsDigit(slice[iteration]))
+ int iteration = slice.IndexOfAnyExceptDigit();
+ if (iteration < 0)
{
- iteration++;
+ iteration = slice.Length;
}
if (iteration == 0)
/// <summary>Whether <see cref="s_defaultTimeout"/> is non-infinite.</summary>
internal static readonly bool s_hasTimeout = s_defaultTimeout != Regex.InfiniteMatchTimeout;
+
+ /// <summary>Finds the next index of any character that does not match a Unicode digit.</summary>
+ [MethodImpl(MethodImplOptions.AggressiveInlining)]
+ internal static int IndexOfAnyExceptDigit(this ReadOnlySpan<char> span)
+ {
+ int i = span.IndexOfAnyExcept(Utilities.s_asciiDigits);
+ if ((uint)i < (uint)span.Length)
+ {
+ if (char.IsAscii(span[i]))
+ {
+ return i;
+ }
+
+ do
+ {
+ if (!char.IsDigit(span[i]))
+ {
+ return i;
+ }
+ i++;
+ }
+ while ((uint)i < (uint)span.Length);
+ }
+
+ return -1;
+ }
+
+ /// <summary>Supports searching for characters in or not in "0123456789".</summary>
+ internal static readonly SearchValues<char> s_asciiDigits = SearchValues.Create("0123456789");
}
}"{\\s*(?<P>\\D\\w*)\\s*\\:\\s*var\\(\\s*(?<B> ..." (9881 uses)[GeneratedRegex("{\\s*(?<P>\\D\\w*)\\s*\\:\\s*var\\(\\s*(?<B>\\D\\w*)\\s*\\)\\s*(;\\s*(?<P>\\D\\w*)\\s*\\:\\s*var\\(\\s*(?<B>\\D\\w*)\\s*\\)\\s*\\s*)*}")] slice = inputSpan.Slice(pos);
charloop_starting_pos = pos;
- int iteration = 0;
- while ((uint)iteration < (uint)slice.Length && char.IsWhiteSpace(slice[iteration]))
+ int iteration = slice.IndexOfAnyExceptWhiteSpace();
+ if (iteration < 0)
{
- iteration++;
+ iteration = slice.Length;
}
slice = slice.Slice(iteration);
base.CheckTimeout();
}
- if (charloop_starting_pos >= charloop_ending_pos)
+ if (charloop_starting_pos >= charloop_ending_pos ||
+ (charloop_ending_pos = inputSpan.Slice(charloop_starting_pos, charloop_ending_pos - charloop_starting_pos).LastIndexOfAnyExceptDigit()) < 0)
{
UncaptureUntil(0);
return false; // The input didn't match.
}
- pos = --charloop_ending_pos;
+ charloop_ending_pos += charloop_starting_pos;
+ pos = charloop_ending_pos;
slice = inputSpan.Slice(pos);
CharLoopEnd:
// Match a word character atomically any number of times.
{
- int iteration1 = 1;
- while ((uint)iteration1 < (uint)slice.Length && Utilities.IsWordChar(slice[iteration1]))
+ int iteration1 = slice.Slice(1).IndexOfAnyExceptWordChar();
+ if (iteration1 < 0)
{
- iteration1++;
+ iteration1 = slice.Length - 1;
}
slice = slice.Slice(iteration1);
pos += iteration1;
}
+ pos++;
+ slice = inputSpan.Slice(pos);
base.Capture(2, capture_starting_pos, pos);
}
// Match a whitespace character atomically any number of times.
{
- int iteration2 = 0;
- while ((uint)iteration2 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration2]))
+ int iteration2 = slice.IndexOfAnyExceptWhiteSpace();
+ if (iteration2 < 0)
{
- iteration2++;
+ iteration2 = slice.Length;
}
slice = slice.Slice(iteration2);
// Match a whitespace character atomically any number of times.
{
- int iteration3 = 1;
- while ((uint)iteration3 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration3]))
+ int iteration3 = slice.Slice(1).IndexOfAnyExceptWhiteSpace();
+ if (iteration3 < 0)
{
- iteration3++;
+ iteration3 = slice.Length - 1;
}
slice = slice.Slice(iteration3);
}
// Match the string "var(".
- if (!slice.StartsWith("var("))
+ if (!slice.Slice(1).StartsWith("var("))
{
goto CharLoopBacktrack;
}
// Match a whitespace character greedily any number of times.
//{
- pos += 4;
+ pos += 5;
slice = inputSpan.Slice(pos);
charloop_starting_pos1 = pos;
- int iteration4 = 0;
- while ((uint)iteration4 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration4]))
+ int iteration4 = slice.IndexOfAnyExceptWhiteSpace();
+ if (iteration4 < 0)
{
- iteration4++;
+ iteration4 = slice.Length;
}
slice = slice.Slice(iteration4);
base.CheckTimeout();
}
- if (charloop_starting_pos1 >= charloop_ending_pos1)
+ if (charloop_starting_pos1 >= charloop_ending_pos1 ||
+ (charloop_ending_pos1 = inputSpan.Slice(charloop_starting_pos1, charloop_ending_pos1 - charloop_starting_pos1).LastIndexOfAnyExceptDigit()) < 0)
{
goto CharLoopBacktrack;
}
- pos = --charloop_ending_pos1;
+ charloop_ending_pos1 += charloop_starting_pos1;
+ pos = charloop_ending_pos1;
slice = inputSpan.Slice(pos);
CharLoopEnd1:
// Match a word character atomically any number of times.
{
- int iteration5 = 1;
- while ((uint)iteration5 < (uint)slice.Length && Utilities.IsWordChar(slice[iteration5]))
+ int iteration5 = slice.Slice(1).IndexOfAnyExceptWordChar();
+ if (iteration5 < 0)
{
- iteration5++;
+ iteration5 = slice.Length - 1;
}
slice = slice.Slice(iteration5);
pos += iteration5;
}
+ pos++;
+ slice = inputSpan.Slice(pos);
base.Capture(3, capture_starting_pos1, pos);
}
// Match a whitespace character atomically any number of times.
{
- int iteration6 = 0;
- while ((uint)iteration6 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration6]))
+ int iteration6 = slice.IndexOfAnyExceptWhiteSpace();
+ if (iteration6 < 0)
{
- iteration6++;
+ iteration6 = slice.Length;
}
slice = slice.Slice(iteration6);
slice = inputSpan.Slice(pos);
charloop_starting_pos2 = pos;
- int iteration7 = 0;
- while ((uint)iteration7 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration7]))
+ int iteration7 = slice.IndexOfAnyExceptWhiteSpace();
+ if (iteration7 < 0)
{
- iteration7++;
+ iteration7 = slice.Length;
}
slice = slice.Slice(iteration7);
slice = inputSpan.Slice(pos);
charloop_starting_pos3 = pos;
- int iteration8 = 0;
- while ((uint)iteration8 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration8]))
+ int iteration8 = slice.IndexOfAnyExceptWhiteSpace();
+ if (iteration8 < 0)
{
- iteration8++;
+ iteration8 = slice.Length;
}
slice = slice.Slice(iteration8);
base.CheckTimeout();
}
- if (charloop_starting_pos3 >= charloop_ending_pos3)
+ if (charloop_starting_pos3 >= charloop_ending_pos3 ||
+ (charloop_ending_pos3 = inputSpan.Slice(charloop_starting_pos3, charloop_ending_pos3 - charloop_starting_pos3).LastIndexOfAnyExceptDigit()) < 0)
{
goto LoopIterationNoMatch;
}
- pos = --charloop_ending_pos3;
+ charloop_ending_pos3 += charloop_starting_pos3;
+ pos = charloop_ending_pos3;
slice = inputSpan.Slice(pos);
CharLoopEnd3:
// Match a word character atomically any number of times.
{
- int iteration9 = 1;
- while ((uint)iteration9 < (uint)slice.Length && Utilities.IsWordChar(slice[iteration9]))
+ int iteration9 = slice.Slice(1).IndexOfAnyExceptWordChar();
+ if (iteration9 < 0)
{
- iteration9++;
+ iteration9 = slice.Length - 1;
}
slice = slice.Slice(iteration9);
pos += iteration9;
}
+ pos++;
+ slice = inputSpan.Slice(pos);
base.Capture(2, capture_starting_pos3, pos);
}
// Match a whitespace character atomically any number of times.
{
- int iteration10 = 0;
- while ((uint)iteration10 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration10]))
+ int iteration10 = slice.IndexOfAnyExceptWhiteSpace();
+ if (iteration10 < 0)
{
- iteration10++;
+ iteration10 = slice.Length;
}
slice = slice.Slice(iteration10);
// Match a whitespace character atomically any number of times.
{
- int iteration11 = 1;
- while ((uint)iteration11 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration11]))
+ int iteration11 = slice.Slice(1).IndexOfAnyExceptWhiteSpace();
+ if (iteration11 < 0)
{
- iteration11++;
+ iteration11 = slice.Length - 1;
}
slice = slice.Slice(iteration11);
}
// Match the string "var(".
- if (!slice.StartsWith("var("))
+ if (!slice.Slice(1).StartsWith("var("))
{
goto CharLoopBacktrack3;
}
// Match a whitespace character greedily any number of times.
//{
- pos += 4;
+ pos += 5;
slice = inputSpan.Slice(pos);
charloop_starting_pos4 = pos;
- int iteration12 = 0;
- while ((uint)iteration12 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration12]))
+ int iteration12 = slice.IndexOfAnyExceptWhiteSpace();
+ if (iteration12 < 0)
{
- iteration12++;
+ iteration12 = slice.Length;
}
slice = slice.Slice(iteration12);
base.CheckTimeout();
}
- if (charloop_starting_pos4 >= charloop_ending_pos4)
+ if (charloop_starting_pos4 >= charloop_ending_pos4 ||
+ (charloop_ending_pos4 = inputSpan.Slice(charloop_starting_pos4, charloop_ending_pos4 - charloop_starting_pos4).LastIndexOfAnyExceptDigit()) < 0)
{
goto CharLoopBacktrack3;
}
- pos = --charloop_ending_pos4;
+ charloop_ending_pos4 += charloop_starting_pos4;
+ pos = charloop_ending_pos4;
slice = inputSpan.Slice(pos);
CharLoopEnd4:
// Match a word character atomically any number of times.
{
- int iteration13 = 1;
- while ((uint)iteration13 < (uint)slice.Length && Utilities.IsWordChar(slice[iteration13]))
+ int iteration13 = slice.Slice(1).IndexOfAnyExceptWordChar();
+ if (iteration13 < 0)
{
- iteration13++;
+ iteration13 = slice.Length - 1;
}
slice = slice.Slice(iteration13);
pos += iteration13;
}
+ pos++;
+ slice = inputSpan.Slice(pos);
base.Capture(3, capture_starting_pos4, pos);
}
// Match a whitespace character atomically any number of times.
{
- int iteration14 = 0;
- while ((uint)iteration14 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration14]))
+ int iteration14 = slice.IndexOfAnyExceptWhiteSpace();
+ if (iteration14 < 0)
{
- iteration14++;
+ iteration14 = slice.Length;
}
slice = slice.Slice(iteration14);
// Match a whitespace character atomically any number of times.
{
- int iteration15 = 1;
- while ((uint)iteration15 < (uint)slice.Length && char.IsWhiteSpace(slice[iteration15]))
+ int iteration15 = slice.Slice(1).IndexOfAnyExceptWhiteSpace();
+ if (iteration15 < 0)
{
- iteration15++;
+ iteration15 = slice.Length - 1;
}
slice = slice.Slice(iteration15);
pos += iteration15;
}
+ pos++;
+ slice = inputSpan.Slice(pos);
base.Capture(1, capture_starting_pos2, pos);
Utilities.StackPush(ref base.runstack!, ref stackpos, capture_starting_pos2);
/// <summary>Whether <see cref="s_defaultTimeout"/> is non-infinite.</summary>
internal static readonly bool s_hasTimeout = s_defaultTimeout != Regex.InfiniteMatchTimeout;
+ /// <summary>Finds the next index of any character that does not match a whitespace character.</summary>
+ [MethodImpl(MethodImplOptions.AggressiveInlining)]
+ internal static int IndexOfAnyExceptWhiteSpace(this ReadOnlySpan<char> span)
+ {
+ int i = span.IndexOfAnyExcept(Utilities.s_asciiWhiteSpace);
+ if ((uint)i < (uint)span.Length)
+ {
+ if (char.IsAscii(span[i]))
+ {
+ return i;
+ }
+
+ do
+ {
+ if (!char.IsWhiteSpace(span[i]))
+ {
+ return i;
+ }
+ i++;
+ }
+ while ((uint)i < (uint)span.Length);
+ }
+
+ return -1;
+ }
+
+ /// <summary>Finds the next index of any character that does not match a word character.</summary>
+ [MethodImpl(MethodImplOptions.AggressiveInlining)]
+ internal static int IndexOfAnyExceptWordChar(this ReadOnlySpan<char> span)
+ {
+ int i = span.IndexOfAnyExcept(Utilities.s_asciiWordChars);
+ if ((uint)i < (uint)span.Length)
+ {
+ if (char.IsAscii(span[i]))
+ {
+ return i;
+ }
+
+ do
+ {
+ if (!Utilities.IsWordChar(span[i]))
+ {
+ return i;
+ }
+ i++;
+ }
+ while ((uint)i < (uint)span.Length);
+ }
+
+ return -1;
+ }
+
/// <summary>Determines whether the character is part of the [\w] set.</summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static bool IsWordChar(char ch)
(WordCategoriesMask & (1 << (int)CharUnicodeInfo.GetUnicodeCategory(ch))) != 0;
}
+ /// <summary>Finds the last index of any character that matches any character other than a Unicode digit.</summary>
+ [MethodImpl(MethodImplOptions.AggressiveInlining)]
+ internal static int LastIndexOfAnyExceptDigit(this ReadOnlySpan<char> span)
+ {
+ int i = span.LastIndexOfAnyExcept(Utilities.s_asciiDigits);
+ if (i >= 0)
+ {
+ if (char.IsAscii(span[i]))
+ {
+ return i;
+ }
+
+ do
+ {
+ if (!char.IsDigit(span[i]))
+ {
+ return i;
+ }
+ i--;
+ }
+ while (i >= 0);
+ }
+
+ return -1;
+ }
+
/// <summary>Pops 2 values from the backtracking stack.</summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static void StackPop(int[] stack, ref int pos, out int arg0, out int arg1)
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xFF, 0x03,
0xFE, 0xFF, 0xFF, 0x87, 0xFE, 0xFF, 0xFF, 0x07
};
+
+ /// <summary>Supports searching for characters in or not in "0123456789".</summary>
+ internal static readonly SearchValues<char> s_asciiDigits = SearchValues.Create("0123456789");
+
+ /// <summary>Supports searching for characters in or not in "\t\n\v\f\r ".</summary>
+ internal static readonly SearchValues<char> s_asciiWhiteSpace = SearchValues.Create("\t\n\v\f\r ");
+
+ /// <summary>Supports searching for characters in or not in "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz".</summary>
+ internal static readonly SearchValues<char> s_asciiWordChars = SearchValues.Create("0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz");
}
}"^[a-f0-9]{32}$" (4920 uses)[GeneratedRegex("^[a-f0-9]{32}$")] // Match a lowercase hexadecimal digit exactly 32 times.
{
- if ((uint)slice.Length < 32 || slice.Slice(0, 32).ContainsAnyExcept(Utilities.s_asciiHexDigitsLower))
+ if ((uint)slice.Length < 32 || slice.Slice(0, 32).IndexOfAnyExcept(Utilities.s_asciiHexDigitsLower) >= 0)
{
return false; // The input didn't match.
}For more diff examples, see https://gist.github.com/MihuBot/d53c23335fba4c7efa18feb532b317b3 JIT assembly changesFor a list of JIT diff regressions, see Regressions.md Sample source code for further analysisconst string JsonPath = "RegexResults-1816.json";
if (!File.Exists(JsonPath))
{
await using var archiveStream = await new HttpClient().GetStreamAsync("https://mihubot.xyz/r/FJQdSsTA");
using var archive = new ZipArchive(archiveStream, ZipArchiveMode.Read);
archive.Entries.First(e => e.Name == "Results.json").ExtractToFile(JsonPath);
}
using FileStream jsonFileStream = File.OpenRead(JsonPath);
RegexEntry[] entries = JsonSerializer.Deserialize<RegexEntry[]>(jsonFileStream, new JsonSerializerOptions { IncludeFields = true })!;
Console.WriteLine($"Working with {entries.Length} patterns");
record KnownPattern(string Pattern, RegexOptions Options, int Count);
sealed class RegexEntry
{
public required KnownPattern Regex { get; set; }
public required string MainSource { get; set; }
public required string PrSource { get; set; }
public string? FullDiff { get; set; }
public string? ShortDiff { get; set; }
public (string Name, string Values)[]? SearchValuesOfChar { get; set; }
public (string[] Values, StringComparison ComparisonType)[]? SearchValuesOfString { get; set; }
} |
|
See benchmark results at https://gist.github.com/MihuBot/fc3e1b56f57e271bcf91679c16b76299 |
Extend both the source generator (
RegexGenerator.Emitter.cs) andRegexCompilerto vectorize regex loops for complex character classes that previously fell through to scalar character-by-character matching. This primarily benefits the extremely common\w+,\d+,\s+patterns (and their negations), as well as Unicode category sets and sets with subtraction.What changed
TryEmitIndexOf(source generator) — now always succeeds for set families by falling back toEmitIndexOfAnyCustomHelper, which generates a helper using vectorizedSearchValues<char>for ASCII with a scalarMatchCharacterClassfallback for non-ASCII.EmitIndexOf/EmitIndexOfWithCharClassFallback(RegexCompiler) — parallel change that emits inline IL for the same pattern:IndexOfAnyExcept(SearchValues)for fast ASCII scanning, thenEmitMatchCharacterClassfor non-ASCII fallback. The existing inline code fromEmitFixedSet_LeftToRightwas extracted into the reusableEmitIndexOfWithCharClassFallbackmethod.CanEmitIndexOf— simplified to return true for allIsSetFamilynodes sinceEmitIndexOfcan now handle them all.EmitIndexOfAnyCustomHelper— extended withnegateandlastparameters. Negation is applied at the boolean level (viaCharInClass/MatchCharacterClass) rather than by toggling the set string's internal negation flag, which doesn't correctly negate subtraction sets due to howCharInClassIterativeprocesses subtraction chains.Repeater path — changed from the fragile
indexOfExpr.Replace("IndexOf", "Contains")pattern to the equivalent{indexOfExpr} >= 0.Impact
Patterns like
\w+,\d+,\s+,[\p{L}]+,[\w-[abc]]+in atomic loops (which most become via auto-atomicity) and fixed-count repeaters now use vectorized search instead of scalar loops, for bothRegexOptions.Compiledand source-generated regexes.