Skip to content

Missing documentation of UTF8PROC_DECOMPOSE, UTF8PROC_COMPOSE flags in utf8proc_decompose_char #290

Open
@ceztko

Description

@ceztko

Based on the actual utf8proc_NFKC implementation, I tried with success to write a NFKC normalization C++ function that operates directly on UTF32 code points:

bool tryNormalizeNFKC(const vector<char32_t>& codePoints, vector<char32_t>& normalized)
{
    normalized.clear();
    normalized.reserve(codePoints.size());

    char32_t buff[8];
    utf8proc_ssize_t rc;
    int lastBoundClass;
    for (size_t i = 0; i < codePoints.size(); i++)
    {
        // NOTE: UTF8PROC_DECOMPOSE is undocumented for utf8proc_decompose_char but it's necessary
        rc = utf8proc_decompose_char(codePoints[i], (utf8proc_int32_t*)buff, std::size(buff),
            (utf8proc_option_t)(UTF8PROC_DECOMPOSE | UTF8PROC_COMPAT), &lastBoundClass);
        if (rc < 0 || rc > std::size(buff))
            goto Fail;

        normalized.insert(normalized.end(), buff, buff + rc);
    }

    rc = utf8proc_normalize_utf32((utf8proc_int32_t*)normalized.data(),
        (utf8proc_ssize_t)normalized.size(), (utf8proc_option_t)(UTF8PROC_COMPOSE | UTF8PROC_STABLE));

    if (rc < 0)
        goto Fail;

    normalized.resize((size_t)rc);
    return true;

Fail:
    normalized.clear();
    return false;
}

This is more convenient for me to use instead of utf8proc_NFKC, since I already have the vector of char32_t codepoints, which I also need to further postprocess after the normalization. The only problem I found is that UTF8PROC_DECOMPOSE or UTF8PROC_COMPOSE are not documented as accepted flags in utf8proc_decompose_char, but either one of two is necessary to perform the desired transformation. Considering that the function has 'decompose' in the name that is even more confusing (I got it working just with try and guess and a bit of luck).

If you bother also clarifying a couple of other things:

  • What's the maximum size I need utf8proc_decompose_char for the dst buffer (I guess that there exists a static max value)?
  • I noticed UTF8PROC_STABLE may currently be unused in the code utf8proc code, correct?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions