Skip to content

<regex>: What names can and should regex_traits::lookup_collatename() recognize? #5393

@muellerj2

Description

@muellerj2
Contributor

The title states this as one question, but it's really two.

First is an actual choice: Should the portable character names in the POSIX standard be recognized and translated bylookup_collatename()?

Cons:

  • These aren't actually names for locale-specific collating elements (which are defined in Chapter 7); they are symbolic names of characters for internal use in the POSIX standard.
  • TR1 was deliberately changed to not mandate that these are recognized by lookup_coolatename() (see the end of Section 2 of N1623).

Pro:

The second question is a technical issue that I don't have an answer to yet: Can we actually access the set of locale-specific (multi-character) collating elements or recognize them in a reasonable way using some Windows API? Or is there some other reasonable approach to recognize locale-specific collating elements like "ch" in Czech or "dzs" in Hungarian?

Activity

added
decision neededWe need to choose something before working on this
regexmeow is a substring of homeowner
on Apr 9, 2025
muellerj2

muellerj2 commented on Jun 7, 2025

@muellerj2
ContributorAuthor

I have been looking a bit into the recognition of locale-specific collating elements:

  • I failed to find a simple Windows API to query collating elements of a locale.
  • In ICU, there is the concept of contractions, which is similar to POSIX' collating elements. ICU supports querying the list of contractions for a specific locale, so maybe we could solve this by querying ICU. But do we want to make regex dependent on ICU?
  • It's possible to deduce indirectly from collate::transform() (which is based on LCMapStringEx) whether a character sequence is a collating element: Because collating elements behave like single characters, sort keys of the same length are produced for simple alphabetic characters and collating elements (on top of a few more structural similarities between the sort keys). The problem is that the structure of this sort key doesn't seem documented, so I'm not sure to what extent the stability of the sort key structure is guaranteed for LCMapStringEx or whether this observation relies on an implementation detail subject to change. (This "sort key sniffing" approach already drives some locale-specific regex behavior in Boost.Regex and libc++, including recognition of collating elements in libc++; Boost.Regex instead hardcodes a few collating elements independent of the actual imbued locale. )
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    decision neededWe need to choose something before working on thisquestionFurther information is requestedregexmeow is a substring of homeowner

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @StephanTLavavej@muellerj2

        Issue actions

          `<regex>`: What names can and should `regex_traits::lookup_collatename()` recognize? · Issue #5393 · microsoft/STL