-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Open
Labels
decision neededWe need to choose something before working on thisWe need to choose something before working on thisquestionFurther information is requestedFurther information is requestedregexmeow is a substring of homeownermeow is a substring of homeowner
Description
The title states this as one question, but it's really two.
First is an actual choice: Should the portable character names in the POSIX standard be recognized and translated bylookup_collatename()
?
Cons:
- These aren't actually names for locale-specific collating elements (which are defined in Chapter 7); they are symbolic names of characters for internal use in the POSIX standard.
- TR1 was deliberately changed to not mandate that these are recognized by
lookup_coolatename()
(see the end of Section 2 of N1623).
Pro:
- Boost.Regex and the other standard libraries recognize these names as an extension.
- After
<regex>
: Properly parse and match collating symbols and equivalences #5392, these names are the final puzzle piece to make a few more libcxx tests pass.
The second question is a technical issue that I don't have an answer to yet: Can we actually access the set of locale-specific (multi-character) collating elements or recognize them in a reasonable way using some Windows API? Or is there some other reasonable approach to recognize locale-specific collating elements like "ch" in Czech or "dzs" in Hungarian?
Metadata
Metadata
Assignees
Labels
decision neededWe need to choose something before working on thisWe need to choose something before working on thisquestionFurther information is requestedFurther information is requestedregexmeow is a substring of homeownermeow is a substring of homeowner
Type
Projects
Milestone
Relationships
Development
Select code repository
Activity
<regex>
: Properly parse and match collating symbols and equivalences #5392<regex>
:regex_traits::transform_primary
should yield primary sort keys appropriate for the imbued locale #5444muellerj2 commentedon Jun 7, 2025
I have been looking a bit into the recognition of locale-specific collating elements:
collate::transform()
(which is based onLCMapStringEx
) whether a character sequence is a collating element: Because collating elements behave like single characters, sort keys of the same length are produced for simple alphabetic characters and collating elements (on top of a few more structural similarities between the sort keys). The problem is that the structure of this sort key doesn't seem documented, so I'm not sure to what extent the stability of the sort key structure is guaranteed forLCMapStringEx
or whether this observation relies on an implementation detail subject to change. (This "sort key sniffing" approach already drives some locale-specific regex behavior in Boost.Regex and libc++, including recognition of collating elements in libc++; Boost.Regex instead hardcodes a few collating elements independent of the actual imbued locale. )