Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : support raw NUL bytes in tokens #8992

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

compilade
Copy link
Collaborator

There are some models with NUL (\0) in their vocab, notably InternLM2 which has both <0x00> and a literal \0 token.

The GGUF format uses counted strings, so using NUL would not be a problem if gguf_get_val_str returned something more than a NUL-terminated const char *.

This adds gguf_get_val_str_n and gguf_get_arr_str_n to get the length of a string stored in GGUF metadata. The constructor for std::string supports using a length, so it's used in src/llama.cpp.

Keeping the NUL token in InternLM2 no longer makes src/llama.cpp crash, and so it should now correctly tokenize strings containing NUL bytes when using that tokenizer.

Technically, gguf_get_val_str_n might not strictly be needed because metadata usually never contains NUL bytes, but I'm including it for consistency with gguf_get_arr_str_n.

Alternative implementations

A separate function to get the length of a string is not ideal, because each string accessed in GGUF needs 2 calls to be safe with NUL bytes.

Some other ideas:

  • gguf_get_arr_str could return a gguf_str
    • would be a breaking change
    • That type would need to get exported in ggml.h
    • No implicit conversion to std::string
  • gguf_get_arr_str could return the length by modifying a value by reference
    • would be a breaking change
    • not usable inline when creating a std::string

I could be missing something, in which case I'd welcome a cleaner solution.


TODO

  • Should the return type of gguf_get_val_str_n and gguf_get_arr_str_n be something else than int, like size_t or uint64_t?
  • Should gguf_set_val_str also support NUL bytes in strings?
    • Maybe not?
  • Test InternLM2's tokenizer with test-tokenizer-random.py

Note that I consider this to be somewhat low priority.

@compilade compilade added need feedback Testing and feedback with results are needed python python script changes ggml changes relating to the ggml tensor library for machine learning labels Aug 12, 2024
@ggerganov
Copy link
Owner

Should the return type of gguf_get_val_str_n and gguf_get_arr_str_n be something else than int, like size_t or uint64_t?

The internal n members in gguf are uint64_t but the API returns int, so there is a bit of inconsistency there. It's OK to return int here and at some point we can resolve the inconsistency all together

Should gguf_set_val_str also support NUL bytes in strings?

gguf_str technically supports to have NUL bytes in the string, so I guess we can add this. But maybe later, if there is need to do so

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples ggml changes relating to the ggml tensor library for machine learning need feedback Testing and feedback with results are needed python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants