Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use std::byte for strings #7

Open
tleydxdy opened this issue Feb 20, 2024 · 8 comments
Open

use std::byte for strings #7

tleydxdy opened this issue Feb 20, 2024 · 8 comments

Comments

@tleydxdy
Copy link

Hi I'm new to the whole bencode thing so thanks a lot for this easy to use library! I was wondering, can a bencoded string include any arbitrary bytes, if so would std::byte be a better container for it?

@jimporter
Copy link
Owner

jimporter commented Feb 20, 2024

The specification doesn't define the encoding, but in practice, most cases of bencoded data just use ASCII strings. You should be able to use any string encoding you like though, including a string of bytes. Just use bencode::basic_data with your favorite types, like:

using bdata = bencode::basic_data<std::variant, long long, my_byte_string, std::vector, bencode::map_proxy>;

(Well, UTF-16 might not work. It probably wouldn't be hard to make it work, but that'd make for a very weird payload - ASCII for the "keywords" and UTF-16 for the string values...)

@tleydxdy
Copy link
Author

the "node_id" for the mainline DHT is what's troubling me. it is just a random 160 bit value, it could contain any byte value even null bytes, so using a string seems dangerous for me here.

The specification doesn't define the encoding

Yes, so imo using an interface that can "interpret" the content is dangerous. The user knows the semantic of the bencoded data they are dealing with best, and they can parse the raw bytes into the right encoding with proper validation.

@jimporter
Copy link
Owner

jimporter commented Feb 21, 2024

the "node_id" for the mainline DHT is what's troubling me. it is just a random 160 bit value, it could contain any byte value even null bytes, so using a string seems dangerous for me here.

Dangerous how? std::string doesn't require a null-terminated string and is perfectly fine with embedded nulls. std::string also doesn't define a character encoding, which I think makes it a good choice to use for a sequence of bytes with undefined encoding.

EDIT: Of course, if you want to be extra-careful in your project to avoid passing around a std::string that you might misinterpret, I don't think there should be any issue with getting this library to use a different type for string values that avoid this issue (though note that the keys of dictionaries are also byte strings, so it could complicate key lookups).

@jimporter
Copy link
Owner

jimporter commented Feb 21, 2024

All that said, maybe it would be useful for this library to preserve the underlying byte type when decoding. That is, if you pass it a std::string to decode, you get a variant containing std::strings. If instead you pass a std::vector<std::byte>, then you get a variant whose string type is likewise std::vector<std::byte>. (Ditto std::basic_string<std::byte>.) That shouldn't be too hard.

@tleydxdy
Copy link
Author

I agree, using std::string "works fine" because of what you mentioned, however the std::string also provides opportunity for foot-guns, especially arithmetic op on signed chars and char size (char can be wider than 8 bit). I think my feeling of unease comes from the fact that "char" is meant to hold a character from the machine-native character set, but bencode is a wire format that is machine independent.

@jimporter
Copy link
Owner

char can be wider than 8 bit

The same applies to std::byte though. The standard defines it as enum class byte : unsigned char {};. Maybe bencode.hpp should just static_assert that CHAR_BIT is 8; I doubt things would work without that.

I think my feeling of unease comes from the fact that "char" is meant to hold a character from the machine-native character set, but bencode is a wire format that is machine independent.

I'm not sure that's how I'd interpret the standard. char is certainly meant to be able to hold a character from the execution character set, and that there's an encoding associated with character literals (the "ordinary literal encoding"). However, the C++ standard also calls out support for char holding multibyte code units (e.g. individual bytes from UTF-8 or Shift JIS), so char is already specced to hold multiple encodings. You could argue that that's due to its C legacy, but I'm not sure using std::byte for Shift JIS text is really in the spirit of std::byte either, as its paper says it's not a character type. So I think char makes sense as the underlying type for "string of character bytes with some arbitrary encoding".

In any case, this is somewhat different from the more-specific concern that some "strings" in a bencoded message are pure binary data, rather than a sequence of characters with some particular encoding. In that case, there's probably some practical value in working with std::bytes, as the bytes aren't characters in the first place.

@tleydxdy
Copy link
Author

hmm you are right, std::byte is also variable width. what a mess. I guess the only sensible choice would be uint8_t then.

yes, I agree with you that any sort of text encoding should probably be accessed via some sort of character type. but in bencode a "string" is really just a byte vector, it is not nessesarily used to store text, even in the bep3 spec it is used to store raw sha1 hashes rather than say a hex coded hash.

I think makes the most sense is to by default provide raw bytes, and the user can cast to the right type/encoding base on external knowledge they have about what the bencoded thing "is". this is what bep3 does as well, every time a bencoded string is meant to be text it mentions it explicitly (also the encoding), e.g.:

The name key maps to a UTF-8 encoded string which is the suggested name to save the file (or directory) as.

@jimporter
Copy link
Owner

hmm you are right, std::byte is also variable width. what a mess. I guess the only sensible choice would be uint8_t then.

std::uint8_t is an optional type, so it's not available on all platforms. If it is defined, then char and std::byte would have the required properties on that platform too.

yes, I agree with you that any sort of text encoding should probably be accessed via some sort of character type. but in bencode a "string" is really just a byte vector, it is not nessesarily used to store text, even in the bep3 spec it is used to store raw sha1 hashes rather than say a hex coded hash.

Once I implement it, you'll be able to provide a string (or stream) of std::bytes to bencode::decode, and you'll get back a variant whose string type is std::basic_string<std::byte>. However, I don't intend to change the behavior if you provide a string/stream of chars: you'll get back a variant whose string type is std::string, since that's the type compatible with the input.

This way it's up to callers to specify what they want, since bencode.hpp won't do any type conversion on the underlying element type whatsoever (aside from checking elements with syntactic meaning of course).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants