Skip to content

Missing documentation of UTF8PROC_DECOMPOSE, UTF8PROC_COMPOSE flags in utf8proc_decompose_char #290

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ceztko opened this issue Mar 16, 2025 · 4 comments

Comments

@ceztko
Copy link

ceztko commented Mar 16, 2025

Based on the actual utf8proc_NFKC implementation, I tried with success to write a NFKC normalization C++ function that operates directly on UTF32 code points:

bool tryNormalizeNFKC(const vector<char32_t>& codePoints, vector<char32_t>& normalized)
{
    normalized.clear();
    normalized.reserve(codePoints.size());

    char32_t buff[8];
    utf8proc_ssize_t rc;
    int lastBoundClass;
    for (size_t i = 0; i < codePoints.size(); i++)
    {
        // NOTE: UTF8PROC_DECOMPOSE is undocumented for utf8proc_decompose_char but it's necessary
        rc = utf8proc_decompose_char(codePoints[i], (utf8proc_int32_t*)buff, std::size(buff),
            (utf8proc_option_t)(UTF8PROC_DECOMPOSE | UTF8PROC_COMPAT), &lastBoundClass);
        if (rc < 0 || rc > std::size(buff))
            goto Fail;

        normalized.insert(normalized.end(), buff, buff + rc);
    }

    rc = utf8proc_normalize_utf32((utf8proc_int32_t*)normalized.data(),
        (utf8proc_ssize_t)normalized.size(), (utf8proc_option_t)(UTF8PROC_COMPOSE | UTF8PROC_STABLE));

    if (rc < 0)
        goto Fail;

    normalized.resize((size_t)rc);
    return true;

Fail:
    normalized.clear();
    return false;
}

This is more convenient for me to use instead of utf8proc_NFKC, since I already have the vector of char32_t codepoints, which I also need to further postprocess after the normalization. The only problem I found is that UTF8PROC_DECOMPOSE or UTF8PROC_COMPOSE are not documented as accepted flags in utf8proc_decompose_char, but either one of two is necessary to perform the desired transformation. Considering that the function has 'decompose' in the name that is even more confusing (I got it working just with try and guess and a bit of luck).

If you bother also clarifying a couple of other things:

  • What's the maximum size I need utf8proc_decompose_char for the dst buffer (I guess that there exists a static max value)?
  • I noticed UTF8PROC_STABLE may currently be unused in the code utf8proc code, correct?
@stevengj
Copy link
Member

stevengj commented Mar 19, 2025

What's the maximum size I need utf8proc_decompose_char for the dst buffer

4 bytes. This should really be documented explicitly, but it's intrinsic to the UTF-8 encoding. Sorry, I was thinking of encoding.

Usually we call it twice, once to get the buffer size and once to do the decomposition, to be safe. But I agree that in principle there should be an upper bound, but unfortunately it may depend on the Unicode version. I'm not sure what the current upper bound is, but it could be computed easily and put in the docs (with a test to make sure that it doesn't need to be changed in future versions).

The problem with documenting the current upper bound, however, is that updating the Unicode version may then potentially break binary compatibility, even if the API doesn't otherwise change.

So the safest thing is to set some reasonable upper bound on the buffer size, but always explicitly check for an error return to see if you need a bigger buffer.

I noticed UTF8PROC_STABLE may currently be unused in the code utf8proc code, correct?

I think so, yes — all of the decompositions are already backwards compatible.

@ceztko
Copy link
Author

ceztko commented Mar 19, 2025

4 bytes. This should really be documented explicitly, but it's intrinsic to the UTF-8 encoding.

Hmmmm....UTF-8 should not be related at all in this transformation. It's the decomposition that, for example, converts the small ligature fl (single code point \ufb01) into the two code points f and l. This is intrinsic of Unicode, not UTF-8, and it depends on actual natural languages scripts. Do you appear to remember what's the maximum number of code points one single code point can be decomposed to?

@ceztko
Copy link
Author

ceztko commented Mar 19, 2025

Ok, I read the edited answer. If you compute the value please update the doc/this issue. Sorry, I'm newbie user of utf8proc but I am happy I could use it for the task and integrate it nicely in C++, without spurious heap allocations.

@stevengj
Copy link
Member

I've computed the value: it's currently 4 chars. I have a PR to add a note to the documentation (while commenting that the value may increase in future versions), and a test to make sure that the hint remains current: #291

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants