-
Notifications
You must be signed in to change notification settings - Fork 18
Configuring the locale for language and encoding‐aware operations
PawnPlus can take use of the system's cultural settings (the "locale") through mechanisms exposed by std::locale
in C++, used for the purposes of formatting and character conversion and comparison.
When loaded, the plugin sets the global locale (via std::locale::global
) to the invariant one (std::locale::classic
, commonly identified as "C"
or "POSIX"
) (so any previously-set locale through the server or environment variables will be ignored) and it supports modifying the global locale through pp_locale
. It should be noted that other C++ modules may share the same global locale, so this settings affect them as well. The C locale (used by modules in C and set by std::setlocale
) is not affected.
The locale can be applied or changed for any number of distinct categories, represented by locale_category
:
enum locale_category (<<= 1)
{
locale_none = 0,
locale_collate = 1,
locale_ctype,
locale_monetary,
locale_numeric,
locale_time,
locale_messages,
locale_all = -1,
}
These categories affect the following areas of the plugin:
-
locale_collate
controls character equivalence and comparisons. It is used for regular expressions whenregex_collate
is set (in such a case, character ranges (e.g.[a-z]
) will use the order of characters imposed by the locale), and bystr_collation_key
/str_set_collation_key
. -
locale_ctype
specifies character categories (letter, digit, etc.) as well as lowercase and uppercase conversions. It is used forstr_to_lower
/str_set_to_lower
,str_to_upper
/str_set_to_upper
, and regular expressions, either when character classes like\d
or[[:alpha:]]
are used, or withregex_icase
. -
locale_numeric
defines how numbers are formatted, for example which character is used for the decimal point (e.g..
or,
). It is used bystr_format
and similar, includingtag_op_string
andtag_op_format
. -
locale_monetary
andlocale_time
affect thed
andt
format selectors.
To make the script encoding-aware, only locale_ctype
is necessary. Some functions also allow entering the encoding manually, and it can be used as a part of a regular expression or format string.
Functions taking a locale or encoding use a unified format to identify it:
encoding+locale name;parameters;…|…
The whole identifier consists of |
-separated alternatives, which are selected in order until the current alternative's locale name identifies a valid system locale (in that case, the encoding and parameters of that alternative are used). Any of the components may be omitted, including their separators (in the case of parameters, the previous ;
must be omitted in that case too).
Encoding may be one of the following:
-
ansi
‒ strings use the narrow (8-bit) character set defined by the locale (commonly referred to as the ANSI encoding). Only codes 0‒255 are assigned. Multi-byte encodings (where a character may be stored in multiple cells) are permitted. -
unicode
‒ strings use the wide character set defined by the locale. This is generally equivalent toutf16
on Windows, andutf32
on Linux. -
utf8
‒ strings use the UTF-8 encoding ‒ characters 0‒127 are encoded directly, while higher characters are broken into multiple cells taking the range 128‒255. -
utf16
‒ strings use the UCS-2 or UTF-16 encoding. Only code units 0‒0xFFFF are assigned. -
utf32
‒ strings use the UTF-32 encoding, in the range 0‒0x10FFFF.
If omitted, it defaults to ansi
. "Encoding" in this case refers to the in-memory representation as cells, not bytes, so a single UTF-32 character still takes one cell and not four.
Note that using any encoding other than ansi
or unicode
outside of str_convert
or set_set_convert
does not bring any improvement to character manipulation. UTF is implemented only for compatibility in such cases, always resorting back to the system's native unicode
support.
Character conversions or comparisons are meaningful only for characters occupying a single cell. This has these implications:
- UTF-8 has access only to ASCII characters (0‒127). Multi-byte characters are opaque to all operations (likewise for all general multi-byte encodings).
- UTF-16 does not recognize surrogate pairs as single characters, being limited to the Basic Multilingual Plane.
- UTF-32 on Windows does not recognize any character outside of the BMP either, as no facilities are provided to access such characters.
Parameters are ;
-separated options that affect the concrete behaviour of functions. They can be one of the following:
-
trunc
‒ this changes the semantics of cells storing characters outside of the range defined by the encoding. By default, such cells are treated as opaque by case conversion and comparison functions. When this parameter is set, they are truncated to the code unit bit size. For example, a cell value0x8800 | 'A'
is not treated as a letter by default in the ANSI encoding in theC
locale, but with;trunc
, it is recognized as a letter, and may be converted to lowercase0x8800 | 'a'
. -
ucs
‒ switches from UTF-16 to UCS-2-compatible behaviour. In practical terms, this means that surrogate characters are treated as regular characters: when converting from UTF-16 to UTF-8, surrogate pairs take up two characters; when using UTF-32,U+D800' to 'U+DFFF
(surrogate pairs) are valid individually. -
bom
‒ when converting from a Unicode encoding, the byte order mark (BOM) is recognized (for UTF-16, it may be used to specify the endianness); when converting to a Unicode encoding, the BOM is generated. -
maxrange
‒ by default, only defines characters up to U+10FFFF, as defined by Unicode, are permitted. With this option set, UTF-32 accepts even characters outside this range. -
native
‒ for conversion between UTF-8 and UTF-16/32, use the locale to perform the conversion instead of a unified implementation. There is generally no reason to use this parameter, since there should be no locale-based variations in the encoding, and other Unicode-affecting parameters may not be respected for the conversion. -
fallback=X
‒ sets X as the fallback character, used when conversion fails. By default,?
is used. This character can be only within 0‒255.
Only trunc
and ucs
make sense in pp_locale
and can be globally set. The other parameters need to be specified explicitly during each conversion.
The locale name used by locale-aware functions needs to be pre-defined. It may be empty (""
) to use the default locale, "C"
to use the invariant locale, "*"
to use the system locale, or any other system-provided locale name, which can be found on POSIX systems by running locale -a
.
What constitutes the "default" varies based on the function. For pp_locale
, empty name is resolved by the system (with the default encoding set to ansi
) and is thus equivalent to "*"
, while all other functions use it to refer to the previously-set global locale. To determine the name of the current or specific locale, you can use pp_locale_name
.
As the particular locale format is system specific, it is necessary to use at least two locale names in the form locale1|locale2
if the script is to be run on both Windows and Linux.
On Windows, the locale name might be either of a form of a language tag (with components in the form language[-Script][-REGION[_sort order]]
; see here for a list of supported languages), or language[_country/region[.code page]]
, using a particular language name and country/region name and a specific code page used when interpreting ANSI text. A locale can also be identified just by specifying the code page, in the form .code page
, using the system default locale but with a specific code page, so for example .1250
can be used to a similar effect as above when dealing just with encodings.
As an example, the locale names cs-CZ
, Czech
, Czech_Czech
, Czech_Czech Republic
, or Czech_Czech Republic.1250
all correspond to the Czech language and regional settings, and use the encoding Windows-1250.
On Linux, the set of supported locales can be extended by running the localedef
command by using pre-existing language and character mapping definitions.
For example, localedef -i cs_CZ -f CP1250 cs_CZ.CP1250
creates a new locale named cs_CZ.CP1250
using the Windows-1250 encoding.
cs_CZ.cp1250|cs-CZ
- A locale identifier with two alternatives. If `cs_CZ.cp1250` is not found, `cs-CZ` is attempted next.