Skip to content

Conversation

@AlisdairM
Copy link
Contributor

We do not recognize universal-character-names when lexing character sequences for literals, and that should include d-char-sequences for raw string literals.

Note that d-char-sequences are limited to a subset of the basic character set that excludes the \ that would mark the start of a universal-character-name, but if we recognized the UCN then that \ would be consumed when recognizing the universal-character-name, and the transformed character would then be ill-formed as either not a member of the basic character set, or as a UCN denoting an element of the basic character set. Rather than creating such an obscure error condition, it is simpler to not recognize UCNs for d-char-sequences just as we do not for any other character sequence and diagnose a more consistent error.

We do not recognize universal-character-names when lexing character
sequences for literals, and that should include d-char-sequences for
raw string literals.

Note that d-char-sequences are limited to a subset of the basic
character set that excludes the \ that would mark the start of a
universal-character-name, but if we recognized the UCN then that
\ would be consumed when recognizing the universal-character-name,
and the transformed character would then be ill-formed as either
not a member of the basic character set, or as a UCN denoting an
element of the basic character set.  Rather than creating such an
obscure error condition, it is simpler to not recognize UCNs for
d-char-sequences just as we do not for any other character sequence
and diagnose a more consistent error.
@jensmaurer
Copy link
Member

Since the lexing grammar of neither a d-char-sequence nor an r-char-sequence recognizes UCNs (same for h-char-sequence and q-char-sequence), I think the proper approach is to strike all those redundant mentions instead of adding more to the list. (Having a note that highlights lexing constructs that are oblivious to UCNs is fine, though.) I thought I saw a request in that direction fly by somewhere recently, but I can't find it right now.

@AlisdairM
Copy link
Contributor Author

I believe that if we struck these mentions, then raw string literals would transform UCNs, as this is the wording that exempts the UCN transformation. The reason we have universal-character-name in the c-char and s-char grammar is to revert this rule to not do the transform. In principle, we could remove c-char-sequence and s-char-sequence from this list, and also strike universal-character-name from their respective grammar, as the phase 3 rules would transform the UCNs to elements of the translation character set before the token grammar is addressed. I am not recommending that change, as we have a consistent treatment of character sequences here, with only d-char-sequence and n-char-sequence omitted. As noted in my commit message, there is no normative impact adding d-char-sequence here, as we are just simplifying the nature of the rule making such usage ill-formed. In the n-char-sequence cases, I believe the current rules would allow for embedding universal-character-names inside an n-char-sequence and that could be valid if said UCN is one that is lexed as a c-char or s-char, e.g., "\N{LATIN SMALL LETTER \N{LATIN SMALL LETTER A}}". Hence, addressing the n-char-sequence case would strictly demand a Core issue.

@jensmaurer
Copy link
Member

The reason why we have universal-character-name separately in c-char and s-char is because we want to delay their interpretation until we initialize the string literal object in [lex.string] p10, even though p8 neuters the most obvious case where that could be exploited. Ah, it seems we can use a UCN to encode a new-line, but we can't have a literal new-line character in a string (according to the grammar for basic-s-char). If we would replace a suitable UCN with new-line in phase 3 for an s-char, we would cause an ill-formed s-char.

The rule in lex.phases p3 is needed, though, because nothing otherwise matches and replaces UCNs in plain source code (outside of literals). That said, maybe we should expressly admit UCNs in the (lexing) grammar for identifier and use a simpler "not s-char, not c-char" rule in lex.phases p3, akin to lex.universal.char p1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants