Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Last elements in basic_charset initialization are discarded #200

Open
3 tasks done
alexbarev opened this issue Dec 6, 2024 · 2 comments
Open
3 tasks done

Bug: Last elements in basic_charset initialization are discarded #200

alexbarev opened this issue Dec 6, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@alexbarev
Copy link
Collaborator

alexbarev commented Dec 6, 2024

Describe the bug

When initializing a basic_charset with character arrays such as those returned by digits() and others, the last element of the array is incorrectly discarded.

inline carray<10> const &digits() noexcept {
static carray<10> const all = {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'};
return all;
}

This is due to the use of count_characters - 1 in the loop that populates the basic_charset bitset.

template <std::size_t count_characters>
explicit basic_charset(char_type const (&chars)[count_characters]) noexcept : basic_charset() {
static_assert(count_characters > 0, "Character array cannot be empty");
for (std::size_t i = 0; i < count_characters - 1; ++i) { // count_characters - 1 to exclude the null terminator
char_type c = chars[i];
bitset_._u64s[sz_bitcast(sz_u8_t, c) >> 6] |= (1ull << (sz_bitcast(sz_u8_t, c) & 63u));
}

While this prevents including a null terminator for string literals like sz::char_set("x"), it causes incorrect behavior when handling character arrays that do not have a null terminator, resulting in the exclusion of the final character.

I believe this local block is the full extent of the affected code.

inline char_set ascii_letters_set() { return char_set {ascii_letters()}; }
inline char_set ascii_lowercase_set() { return char_set {ascii_lowercase()}; }
inline char_set ascii_uppercase_set() { return char_set {ascii_uppercase()}; }
inline char_set ascii_printables_set() { return char_set {ascii_printables()}; }
inline char_set ascii_controls_set() { return char_set {ascii_controls()}; }
inline char_set digits_set() { return char_set {digits()}; }
inline char_set hexdigits_set() { return char_set {hexdigits()}; }
inline char_set octdigits_set() { return char_set {octdigits()}; }
inline char_set punctuation_set() { return char_set {punctuation()}; }
inline char_set whitespaces_set() { return char_set {whitespaces()}; }
inline char_set newlines_set() { return char_set {newlines()}; }
inline char_set base64_set() { return char_set {base64()}; }

Steps to reproduce

#include "stringzilla/stringzilla.hpp"

namespace sz = ashvardanian::stringzilla;

int main() {
        sz::string haystack = "239";

        // Test with null-terminated string
        assert(haystack.contains_only(sz::char_set("0123456789")));
        // Passes: null terminator is correctly discarded

        // Test with initializer list
        static std::initializer_list all = {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'};
        assert (haystack.contains_only(sz::char_set {all}));
        // Passes: constructor for initializer list is called

        // Test with carray
        sz::carray<10> all_digits = {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'};
        assert(haystack.contains_only(sz::char_set {all_digits}));
        assert(haystack.is_digit());
        // Fails: '9' is incorrectly discarded
}

Expected behavior

No asserts

StringZilla version

v3.11.0

Operating System

Ubuntu 22.04.5

Hardware architecture

x86

Which interface are you using?

C++ bindings

Contact Details

No response

Are you open to being tagged as a contributor?

  • I am open to being mentioned in the project .git history as a contributor

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct
@alexbarev alexbarev added the bug Something isn't working label Dec 6, 2024
@alexbarev
Copy link
Collaborator Author

alexbarev commented Dec 7, 2024

I will be able to write fix and add tests tomorrow

Proposed Fix

  1. Using carray with Null Terminator
inline carray<11> const &digits() noexcept { 
    static carray<11> const all = {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '\0'}; 
    return all;
}
  1. Using std::initializer_list<char>
inline auto const &digits() noexcept {
    static std::initializer_list all = {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'};
    return all;
}

Then the constructor for the initializer list will be called here:

inline char_set digits_set() { return char_set {digits()}; }

@ashvardanian
Copy link
Owner

Thank you @alexbarev! Let's start with tests, and later merge a patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants