Description
Based on the actual utf8proc_NFKC
implementation, I tried with success to write a NFKC normalization C++ function that operates directly on UTF32 code points:
bool tryNormalizeNFKC(const vector<char32_t>& codePoints, vector<char32_t>& normalized)
{
normalized.clear();
normalized.reserve(codePoints.size());
char32_t buff[8];
utf8proc_ssize_t rc;
int lastBoundClass;
for (size_t i = 0; i < codePoints.size(); i++)
{
// NOTE: UTF8PROC_DECOMPOSE is undocumented for utf8proc_decompose_char but it's necessary
rc = utf8proc_decompose_char(codePoints[i], (utf8proc_int32_t*)buff, std::size(buff),
(utf8proc_option_t)(UTF8PROC_DECOMPOSE | UTF8PROC_COMPAT), &lastBoundClass);
if (rc < 0 || rc > std::size(buff))
goto Fail;
normalized.insert(normalized.end(), buff, buff + rc);
}
rc = utf8proc_normalize_utf32((utf8proc_int32_t*)normalized.data(),
(utf8proc_ssize_t)normalized.size(), (utf8proc_option_t)(UTF8PROC_COMPOSE | UTF8PROC_STABLE));
if (rc < 0)
goto Fail;
normalized.resize((size_t)rc);
return true;
Fail:
normalized.clear();
return false;
}
This is more convenient for me to use instead of utf8proc_NFKC
, since I already have the vector of char32_t
codepoints, which I also need to further postprocess after the normalization. The only problem I found is that UTF8PROC_DECOMPOSE
or UTF8PROC_COMPOSE
are not documented as accepted flags in utf8proc_decompose_char
, but either one of two is necessary to perform the desired transformation. Considering that the function has 'decompose' in the name that is even more confusing (I got it working just with try and guess and a bit of luck).
If you bother also clarifying a couple of other things:
- What's the maximum size I need
utf8proc_decompose_char
for thedst
buffer (I guess that there exists a static max value)? - I noticed
UTF8PROC_STABLE
may currently be unused in the code utf8proc code, correct?