Bug: Hybrid sorting algorithm is incorrect #208

ashbob999 · 2025-01-03T16:41:21Z

Describe the bug

The hybrid sorting algorithm in bench_sort.cpp has some issues leading to an incorrect performance improvement of 3x (#173).

Issue 1

The 4 characters are inserted into the upper 32-bits of the order variable in the wrong order. So str[3] is the most significant byte of the order, leading to incorrect initial sorting order.

// What if we take up-to 4 first characters and the index
    for (size_t i = 0; i != strings.size(); ++i)
        std::memcpy((char *)&order[i] + offset_in_word, strings[order[i]].c_str(),
                    std::min<std::size_t>(strings[order[i]].size(), 4ul));

Issue 2

Currently offset_in_word is set to 0, which means that the order array which holds the 32-bit string order indexes, gets overridden by the first 4 chars of the string, then subsequently cleared. So the final sort does nothing as all values in order are zero.

Although setting offset_in_word to 4 works for hybrid_sort_cpp it breaks hybrid_stable_sort_cpp.

Fixes

After temporarily fixing the above issues it becomes about 1.1x slower than sz_sort. In which the bottle neck seems to be the final sort.

I managed to improve this by calling std::sort on each of the partially sorted chunks instead of the all strings, which makes it ~2.25x faster than sz_sort.

Fixed version (with `offset_in_word=4`)

static idx_t hybrid_sort_cpp_v2(strings_t const &strings, sz_u64_t *order) {

    // What if we take up-to 4 first characters and the index
    for (size_t i = 0; i != strings.size(); ++i) {
        size_t index = order[i];

        for (size_t j = 0; j < std::min<std::size_t>(strings[(sz_size_t)index].size(), 4ul); ++j) {
            std::memcpy((char *)&order[i] + offset_in_word + 3 - j, strings[(sz_size_t)index].c_str() + j, 1ul);
        }
    }

    std::sort(order, order + strings.size(), [&](sz_u64_t i, sz_u64_t j) {
        char *i_bytes = (char *)&i;
        char *j_bytes = (char *)&j;
        return *(uint32_t *)(i_bytes + offset_in_word) < *(uint32_t *)(j_bytes + offset_in_word);
    });

    const auto extractBytes = [](sz_u64_t v) -> uint32_t {
        char *bytes = (char *)&v;
        return *(uint32_t *)(bytes + offset_in_word);
    };

    if (strings.size() >= 2) {
        size_t prevIndex = 0;
        uint64_t prevBytes = extractBytes(order[0]);

        for (size_t i = 1; i < strings.size(); ++i) {
            //
            uint32_t bytes = extractBytes(order[i]);
            if (bytes != prevBytes) {
                std::sort(order + prevIndex, order + i, [&](sz_u64_t i, sz_u64_t j) {
                   // Assumes: offset_in_word==4
                    sz_size_t i_index = i & 0xFFFF'FFFF;
                    sz_size_t j_index = j & 0xFFFF'FFFF;
                    return strings[i_index] < strings[j_index];
                });
                prevIndex = i;
                prevBytes = bytes;
            }
        }

        std::sort(order + prevIndex, order + strings.size(), [&](sz_u64_t i, sz_u64_t j) {
            sz_size_t i_index = i & 0xFFFF'FFFF;
            sz_size_t j_index = j & 0xFFFF'FFFF;
            return strings[i_index] < strings[j_index];
        });
    }

    for (size_t i = 0; i != strings.size(); ++i) std::memset((char *)&order[i] + offset_in_word, 0, 4ul);

    return strings.size();
}

I did not create a PR, because these fixes break the stable sort.

Steps to reproduce

See Above.

Expected behavior

See Above.

StringZilla version

3.11.3

Operating System

Windows 10

Hardware architecture

x86

Which interface are you using?

Other bindings

Contact Details

No response

Are you open to being tagged as a contributor?

I am open to being mentioned in the project .git history as a contributor

Is there an existing issue for this?

I have searched the existing issues

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

ashvardanian · 2025-01-03T17:27:31Z

Amazing! I will look into ways to fix this.

ashbob999 added the bug Something isn't working label Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Hybrid sorting algorithm is incorrect #208

Bug: Hybrid sorting algorithm is incorrect #208

ashbob999 commented Jan 3, 2025

ashvardanian commented Jan 3, 2025 •

edited

Loading

Bug: Hybrid sorting algorithm is incorrect #208

Bug: Hybrid sorting algorithm is incorrect #208

Comments

ashbob999 commented Jan 3, 2025

Describe the bug

Issue 1

Issue 2

Fixes

Steps to reproduce

Expected behavior

StringZilla version

Operating System

Hardware architecture

Which interface are you using?

Contact Details

Are you open to being tagged as a contributor?

Is there an existing issue for this?

Code of Conduct

ashvardanian commented Jan 3, 2025 • edited Loading

ashvardanian commented Jan 3, 2025 •

edited

Loading