Skip to content

Configuring the locale for language and encoding‐aware operations

IS4 edited this page Jul 20, 2024 · 14 revisions

PawnPlus can take use of the system's cultural settings (the "locale") through mechanisms exposed by std::locale in C++, used for the purposes of formatting and character conversion and comparison.

Overview

When loaded, the plugin sets the global locale (via std::locale::global) to the invariant one (std::locale::classic, commonly identified as "C" or "POSIX") (so any previously-set locale through the server or environment variables will be ignored) and it supports modifying the global locale through pp_locale. It should be noted that other C++ modules may share the same global locale, so this settings affect them as well. The C locale (used by modules in C and set by std::setlocale) is not affected.

The locale can be applied or changed for any number of distinct categories, represented by locale_category:

enum locale_category (<<= 1)
{
    locale_none = 0,
    locale_collate = 1,
    locale_ctype,
    locale_monetary,
    locale_numeric,
    locale_time,
    locale_messages,
    locale_all = -1,
}

These categories affect the following areas of the plugin:

  • locale_collate controls character equivalence and comparisons. It is used for regular expressions when regex_collate is set (in such a case, character ranges (e.g. [a-z]) will use the order of characters imposed by the locale), and by str_collation_key/str_set_collation_key.
  • locale_ctype specifies character categories (letter, digit, etc.) as well as lowercase and uppercase conversions. It is used for str_to_lower/str_set_to_lower, str_to_upper/str_set_to_upper, and regular expressions, either when character classes like \d or [[:alpha:]] are used, or with regex_icase.
  • locale_numeric defines how numbers are formatted, for example which character is used for the decimal point (e.g. . or ,). It is used by str_format and similar, including tag_op_string and tag_op_format.
  • locale_monetary and locale_time affect the d and t format selectors.

To make the script encoding-aware, only locale_ctype is necessary. Some functions also allow entering the encoding manually, and it can be used as a part of a regular expression or format string.

Locale identifier format

Functions taking a locale or encoding use a unified format to identify it:

encoding+locale name;parameters;…|…

The whole identifier consists of |-separated alternatives, which are selected in order until the current alternative's locale name identifies a valid system locale (in that case, the encoding and parameters of that alternative are used). Any of the components may be omitted, including their separators (in the case of parameters, the previous ; must be omitted in that case too).

Encoding

Encoding may be one of the following:

  • ansi ‒ strings use the narrow (8-bit) character set defined by the locale (commonly referred to as the ANSI encoding). Only codes 0‒255 are assigned. Multi-byte encodings (where a character may be stored in multiple cells) are permitted.
  • unicode ‒ strings use the wide character set defined by the locale. This is generally equivalent to utf16 on Windows, and utf32 on Linux.
  • utf8 ‒ strings use the UTF-8 encoding ‒ characters 0‒127 are encoded directly, while higher characters are broken into multiple cells taking the range 128‒255.
  • utf16 ‒ strings use the UCS-2 or UTF-16 encoding. Only code units 0‒0xFFFF are assigned.
  • utf32 ‒ strings use the UTF-32 encoding, in the range 0‒0x10FFFF.

If omitted, it defaults to ansi. "Encoding" in this case refers to the in-memory representation as cells, not bytes, so a single UTF-32 character still takes one cell and not four.

Note that using any encoding other than ansi or unicode outside of str_convert or set_set_convert does not bring any improvement to character manipulation. UTF is implemented only for compatibility in such cases, always resorting back to the system's native unicode support.

Character conversions or comparisons are meaningful only for characters occupying a single cell. This has these implications:

  • UTF-8 has access only to ASCII characters (0‒127). Multi-byte characters are opaque to all operations (likewise for all general multi-byte encodings).
  • UTF-16 does not recognize surrogate pairs as single characters, being limited to the Basic Multilingual Plane.
  • UTF-32 on Windows does not recognize any character outside of the BMP either, as no facilities are provided to access such characters.

Parameters

Parameters are ;-separated options that affect the concrete behaviour of functions. They can be one of the following:

  • trunc ‒ this changes the semantics of cells storing characters outside of the range defined by the encoding. By default, such cells are treated as opaque by case conversion and comparison functions. When this parameter is set, they are truncated to the code unit bit size. For example, a cell value 0x8800 | 'A' is not treated as a letter by default in the ANSI encoding in the C locale, but with ;trunc, it is recognized as a letter, and may be converted to lowercase 0x8800 | 'a'.
  • ucs ‒ switches from UTF-16 to UCS-2-compatible behaviour. In practical terms, this means that surrogate characters are treated as regular characters: when converting from UTF-16 to UTF-8, surrogate pairs take up two characters; when using UTF-32, U+D800' to 'U+DFFF (surrogate pairs) are valid individually.
  • bom ‒ when converting from a Unicode encoding, the byte order mark (BOM) is recognized (for UTF-16, it may be used to specify the endianness); when converting to a Unicode encoding, the BOM is generated.
  • maxrange ‒ by default, only defines characters up to U+10FFFF, as defined by Unicode, are permitted. With this option set, UTF-32 accepts even characters outside this range.
  • native ‒ for conversion between UTF-8 and UTF-16/32, use the locale to perform the conversion instead of a unified implementation. There is generally no reason to use this parameter, since there should be no locale-based variations in the encoding, and other Unicode-affecting parameters may not be respected for the conversion.
  • fallback=X ‒ sets X as the fallback character, used when conversion fails. By default, ? is used. This character can be only within 0‒255.

Only trunc and ucs make sense in pp_locale and can be globally set. The other parameters need to be specified explicitly during each conversion.

Determining the locale name

The locale name used by locale-aware functions needs to be pre-defined. It may be empty ("") to use the default locale, "C" to use the invariant locale, "*" to use the system locale, or any other system-provided locale name, which can be found on POSIX systems by running locale -a.

What constitutes the "default" varies based on the function. For pp_locale, empty name is resolved by the system (with the default encoding set to ansi) and is thus equivalent to "*", while all other functions use it to refer to the previously-set global locale. To determine the name of the current or specific locale, you can use pp_locale_name.

As the particular locale format is system specific, it is necessary to use at least two locale names in the form locale1|locale2 if the script is to be run on both Windows and Linux.

Windows

On Windows, the locale name might be either of a form of a language tag (with components in the form language[-Script][-REGION[_sort order]]; see here for a list of supported languages), or language[_country/region[.code page]], using a particular language name and country/region name and a specific code page used when interpreting ANSI text. A locale can also be identified just by specifying the code page, in the form .code page, using the system default locale but with a specific code page, so for example .1250 can be used to a similar effect as above when dealing just with encodings.

As an example, the locale names cs-CZ, Czech, Czech_Czech, Czech_Czech Republic, or Czech_Czech Republic.1250 all correspond to the Czech language and regional settings, and use the encoding Windows-1250.

Linux

On Linux, the set of supported locales can be extended by running the localedef command by using pre-existing language and character mapping definitions.

For example, localedef -i cs_CZ -f CP1250 cs_CZ.CP1250 creates a new locale named cs_CZ.CP1250 using the Windows-1250 encoding.

Examples

cs_CZ.cp1250|cs-CZ
A locale identifier with two alternatives. If `cs_CZ.cp1250` is not found, `cs-CZ` is attempted next.
Clone this wiki locally